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forward  stoichiometric  coefficients  for  species  s  and  reaction  r 
■ackward  stoichiometric  coefficients  for  species  .s  and  reaction  r 
coordinate  running  in  direction  of  increasing  i  index 
universal  constant  (=  3.1415927) 
mass  densitv  of  mixture 


:=1M 


Ps 

Pi+\/2 

a 

<^1  +  1/2 

tlt 

4> 


Xs 


mcLss  density  of  species  s 

arithmetical  averaged  density 

Stefan- Boltzmann  constant  (=  5.67032  x  10“*  ) 

intermediate  term  for  Harten-Yee  fluxes 

constants  for  k  —  t  turbulence  model 

stress  tensor 

Landau-Teller  time  constant  for  vibrational  relaxation 
parameter  in  extrapolation  of  U  to  surface  for  MUSCL  differencing 
{<f>  =  0:  first  order,  <f>  =  1:  second  order) 
partial  derivative  of  thermal  equation  of  state,  Xs  = 


dp» 


Mathematical  Symbols 


_j  summation  over  index  s  from  1  to  N 

.j  product  over  index  s  from  1  lo  N 

Subscripts 

i,  j,  k  average  value  of  variable  in  ceil  i,j,k 

Superscripts 


n 

n  1 

s 


value  of  variable  at  current  time  step  (time  step  where  solution 
is  known) 

value  of  variable  at  next  time  step  (time  step  where  solution  is 
being  sought) 
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1  Introduction 


Computational  fluid  dynamics  (CFD)  is  becoming  a  major  element  in  the  aerody¬ 
namic  design  and  analysis  of  full  aircraft  and  aircraft  components.  As  computers 
and  solution  algorithms  become  even  faster,  numerical  analysis  will  gradually  re¬ 
place  wind-tunnel  testing  in  most  design  procedures  in  the  aerospace  industry.  The 
emphasis  on  wind  tunnel  testing  will  shift  from  parametric  testing  of  candidate  de¬ 
sign  configurations  to  validation  of  CFD  numerical  analyses  and  verification  of  final 
designs.  CFD  analysis  can  potentially  generate  design  data  much  faster  and  at  sub¬ 
stantially  less  cost  than  by  using  wind-tunnel  testing.  CFD  also  offers  the  capability 
of  numerically  simulating  flow  fields  that  can  not  be  (or  are  extremely  difficult  to  be) 
achieved  in  wind  tunnels.  Since  hypervelocity  high-temperature  flows  are  very  diffi¬ 
cult  to  achieve  in  wind  tunnels,  CFD  analysis  is  of  critical  importance  in  the  design 
of  hypersonic  aircraft  such  as  the  National  Aerospace  Plane  (NASP). 

CFD  analyses  require  immense  computer  resources.  The  future  success  of  CFD 
analysis  depends  greatly  upon  the  development  of  faster  computers  and  better  solu¬ 
tion  algorithms.  Solutions  of  the  3D  Navier-Stokes  equations  (modeling  the  viscous 
flow  of  compressible  fluids)  run  for  hours  on  present  supercomputers  (such  as  the 
Cray-X-MP,  Cray-II,  and  the  Cyber  205)  for  the  analysis  of  flow  about  relatively 
simple  aircraft  components.  Peterson  [1]  projects  that  computers  capable  of  well 
over  one  teraflops  (one  trillion  floating-point  operations  per  second)  and  having  at 
least  one  billion  words  of  memory  will  be  necessary  for  accurately  simulating  whole 
aircraft  turbulent  flow  fields  using  current  algorithms.  The  estimates  are  even  more 
awesome  when  additional  complexities  are  introduced  into  the  equations  modeling  the 
flow  field:  multi-species  chemical  reactions,  multiple  phases,  sub-grid  scale  turbulence 
models,  and  direct  simulation  of  turbulence.  The  (theoretical)  technological  limit  of 
the  Single-lnstruction-Single-Data  (SISD)  and  the  Single-Instruction-Multiple-Data 
(SIMD)  computer  architectures  currently  used  on  most  supercomputers,  however,  is 
only  about  one  gigaflop  (one  billion  floating-point  operations  per  second)  [2].  New 
computer  architectures  must  be  utilized  to  achieve  the  substantial  increases  in  com¬ 
puting  speed  required  for  the  desired  CFD  applications. 

The  only  clear  path  to  achieving  substantial  increases  in  computing  speeds  is  the 
implementation  of  Multiple-Instruction-Multiple-Data  (MIMD)  computer  architec¬ 
tures  —  that  is  parallel  computers.  Numerous  first  generation  parallel  computers 
have  been  built  and/or  are  presently  available.  These  include  the  eight-cpu  Cr?y- 
Y-MP,  ILLIAC-IV,  Denelcor’s  HEP  computer.  Affiant’s  FX/8,  Intel’s  Touchstone 
(iPSC/860),  and  Silicon  Graphics  Iris.  Unfortunately,  the  current  algorithms  have 
been  developed  primarily  for  SISD  machines  with  some  use  of  vectorization.  These 
algorithms  will  not  be  able  to  utilize  the  full  capabilities  of  parallel  computers  effi¬ 
ciently. 

There  is  an  urgent  need  to  develop  parallel  algorithms  for  CFD  flow  analysis  codes 
to  take  advantage  of  parallel  computers,  but  development  of  parallel  algorithms  is  not 
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a  straight  forward  process.  One  of  the  problems  with  the  use  of  parallel  computers  is 
that  their  architectures  differ  in  many  respects.  The  number  of  CPU’s  can  vary  from 
2  to  tens  of  thousands  and  have  greatly  differing  computing  capability.  The  memory 
may  be  globally  shared  by  all  CPU’s  or  each  CPU  may  have  its  own  dedicated  memory. 
The  method  by  which  the  CPU’s  communicate  is  also  a  variable.  Each  CPU  may 
have  a  dedicated  path  to  all  of  the  memory,  a  switching  network  that  connects  the 
CPU’s  to  memory,  a  bus  that  provides  a  common  path  connecting  all  the  CPU’s  with 
the  memory,  or  other  methods  as  well.  The  above  architectural  aspects  of  parallel 
computers,  and  the  fact  that  these  architectures  are  changing  in  time,  make  the 
development  of  parallel  algorithms  difficult. 

The  goal  of  this  work  is  to  develop  a  flow  analysis  procedure  for  solving  the 
three-dimensional  Navier-Stokes  equations.  The  flow  analysis  procedure  is  capable 
of  simulating  three-dimensional  viscous  hypersonic  flows  over  complex  aerodynamic 
bodies,  including  the  effects  of  finite-rate  chemical  reactions.  Specific  applications 
could  include  hypersonic  vehicles  like  the  National  Aerospace  Plane  (NASP),  SDI 
interceptors,  as  well  as  other  conventional  aircraft  flow  fields.  The  flow  analysis 
procedure  utilizes  efficient  parallel  algorithms  for  efficient  computing  on  three  parallel 
computer:  a  CRAY  Y-MP  with  8  processors,  a  distributed  memory  Intel  iPSC860 
computer  with  128  processors,  and  a  Silicon  Graphics  Iris  4-D  with  8  processors. 

The  Navier-Stokes  equations,  with  additional  equations  for  thermochemical  nonequi¬ 
librium,  provide  an  accurate  mathematical  model  of  the  flow  of  gases  over  most  aero¬ 
dynamic  bodies  at  all  speeds  from  low  subsonic  to  hypersonic.  If  engineers  could 
solve  these  flow  equations  accurately  and  timely  (like  within  an  hour)  for  partial  or 
full  aircraft  configurations,  the  aircraft  design  process  would  be  revolutionized.  Engi¬ 
neers  could  rapidly  evaluate  candidate  configurations,  and  explore  radically  different 
designs  quickly  and  inexpensively.  If  numerical  flow  solutions  could  be  performed 
in  much  less  than  one  hour,  then  software  could  be  developed  to  perform  automated 
optimization  of  aircraft  aerodynamic  designs.  The  results  would  be  more  fuel-efficient 
and  higher- performance  aircraft  that  cost  less  to  design  as  compared  to  today’s  air¬ 
craft. 

During  the  Phase  I  work  several  candidate  parallel  algorithms  were  developed  and 
implemented  into  a  prototype  3D  Navier-Stokes  code.  The  algorithms  were  developed 
on  an  Encore  Multimax  computer  [3]  with  10  NS32332  32-bit  microprocessors,  each 
capable  of  executing  2  million  instructions  per  second  (mips),  yielding  a  total  of  20 
MIPS  (millions  of  instructions  per  second)  and  approximately  3  MFLOPS  (millions 
of  floating-point  operations  per  second)  for  the  Linpack  double  precision  benchmark. 
After  incorporating  the  algorithms  into  prototype  computer  codes,  the  codes  were 
applied  to  a  computationally  demanding  flow  field  calculation  on  the  Multimax. 

During  the  current  work  (Phase  II)  the  most  promising  parallel  algorithm  has 
been  refined  and  optimized.  The  algorithm  is  incorporated  into  a  production  code 
capable  of  simulating  hypersonic  flows  over  complete  aircraft  with  complex  geometry. 
The  final  result  of  Phases  II  and  III  will  be  a  useful  CFD  flow  analysis  code  that 


2 


efficiently  utilizes  the  parallel  processing  capability  of  MIMD  computers  with  large 
numbers  of  processors  and  can  be  applied  to  the  most  computationally  demanding 
flow  field  problems. 

This  report  is  broken  into  several  sections.  The  section  following  this  introduc¬ 
tion.  section  2,  discusses  the  specific  tasks  performed  during  this  contract.  Section 
3  discusses  the  model  code  developed  to  experiment  with  parallel  processing  before 
beginning  parallelizing  the  more  complex  Navier-Stokes  code.  Section  4  discusses  the 
governing  equations  and  boundary  condition  solved  by  the  Navier-Stokes  code  and 
section  5  discusses  the  solution  procedure  used  by  the  Navier-Stokes  code.  Section 
6  discusses  the  parallelization  of  the  Navier-Stokes  code  on  three  parallel  computers. 
Section  7  discusses  the  the  suite  of  test  cases  chosen  for  this  contract.  Section  8  dis¬ 
cusses  the  analysis  of  the  parallel  performance  of  the  Navier-Stokes  code  and,  finally, 
section  9  gives  conclusions. 
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2  Phase  II  Tasks 


This  section  contains  a  brief  description  of  the  tasks  completed  during  this  contract. 
Details  of  the  research  and  results  are  discussed  in  following  sections. 

TASK  1.  Develop  a  Parallel  Implicit  Algorithm  for  the  Model  Equa¬ 
tions  for  the  Silicon  Graphics  Iris. 

A  model  code  was  developed  that  simulates  the  operation  count,  memory  ac¬ 
cesses,  memory  storage  requirements,  and  the  ratio  of  serial/ parallel  code  of  the 
Navier-Stokes  equations.  Since  the  model  code  uses  the  parallel  implicit  algorithm 
to  solve  a  simple  set  of  Laplace  equations,  instead  of  the  much  more  complicated 
Navier-Stokes  equations,  the  development  time  is  much  shorter  and  modifying  the 
algorithm  is  much  more  convenient.  The  model  code  allowed  us  to  study  the  behav¬ 
ior  of  the  parallel  implicit  algorithm  on  the  Silicon  Graphics  Iris  4-D  such  as  data 
communication  overhead,  optimum  subzone  size,  and  potential  speedup.  Using  this 
model  code  the  parallel  implicit  algorithm  was  optimized  for  execution  speed.  The 
model  code  was  written  in  Fortran  77.  See  section  3  for  discussion  and  results. 

Task  2,  Develop  a  Parallel  Implicit  Algorithm  for  the  Model  Equations 
for  a  Distributed-Memory  MIMD  Computer. 

As  in  TASK  1  a  model  code  was  developed  to  study  the  behavior  of  the  parallel  im¬ 
plicit  algorithm  on  a  distributed-memory  MIMD  computer.  The  distributed-memory 
MIMD  computers  available  at  the  beginning  of  this  contract  were  examined  and  the 
Intel  Touchstone  iPSC860  computer  system  was  chosen.  Time  was  obtained  on  the 
128  processor  iPSC860  at  NASA’s  Numerical  Aerodynamic  Simulator  (NAS)  to  per¬ 
form  the  programming  of  this  and  latter  tasks. 

Task  3.  Develop  a  Parallel  Implicit  Algorithm  for  the  Model  Equations 
for  the  Cray  Y-MP  Computer. 

As  in  TASK  1  a  model  code  was  developed  to  study  the  behavior  of  the  parallel 
implicit  algorithm  on  a  Cray  Y-MP  supercomputer.  The  code  was  parallelized  using 
macrotasking.  The  Cray  Y-MP  at  NASA’s  Numerical  Aerodynamic  Simulator  (NAS) 
at  Moffett  Field,  California  was  used.  An  important  part  of  the  effort  was  to  compare 
the  cost/performance  of  various  MIMD  computers  with  current  supercomputers. 

Task  4.  Evaluate  Ada,  C,  and  Mixed  Fortran77/C  Languages. 

The  use  of  Ada,  C  and  mixing  Fortran77  with  C,  instead  of  Fortran77  was  eval¬ 
uated.  It  was  decided  that  the  alternative  languages  provided  no  benefit  significant 
enough  to  warrant  reprogramming  the  existing  production  Navier-Stokes  program. 
Fortran  77  was  used  for  all  of  the  remaining  tasks. 

Task  5.  Implement  a  Parallel  Implicit  Algorithm  into  the  Zonal  Navier- 
Stokes  code  for  the  Silicon  Graphics  Iris. 

In  the  first  four  tasks  we  learned  how  the  parallel  implicit  algorithm  should  be  im¬ 
plemented  into  the  full  zonal  Navier-Stokes  code.  With  this  information,  in  this  task 
the  coding  was  performed  for  execution  on  the  Silicon  Graphics  Iris  4-D  computers. 

The  governing  equations  and  solver  for  the  Navier-Stokes  code  are  described  in 
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section  4  and  5,  respectively.  This  code  is  also  used  in  the  following  five  tasks. 

Task  6.  Calculate  Flow  Problems  on  the  Silicon  Graphics  Iris. 

In  this  task  a  suite  of  six  to  ten  real-life  engineering  calculations  was  performed 
on  the  Silicon  Graphics  Iris.  These  cases  include  simple  and  complex  two  and  three- 
dimensional  viscous  flows.  Comparisons  where  made  to  experimental  data.  The 
parallel  efficiency  of  the  algorithm  was  measured  over  a  wide  range  of  processors. 
These  problems  include  subsonic,  transonic,  and  supersonic  flows,  over  a  range  of 
Reynold’s  number,  and  with  various  types  of  boundary  conditions.  The  choice  of  test 
cases  is  described  in  section  7. 

Task  7.  Implement  a  Parallel  Implicit  Algorithm  into  the  Zonal  Navier- 
Stokes  code  for  a  Distributed-Memory  MIMD  Computer. 

The  parallel  implicit  algorithm  was  implemented  in  the  zonal  Navier-Stokes  code 
for  execution  on  a  distributed-memory  MIMD  computer.  The  distributed-memory 
MIMD  computers  available  at  the  beginning  of  this  contract  were  examined  and 
the  Intel  Touchstone  iPSCSGO  computer  system  was  chosen.  Time  was  obtained  on 
the  128  processor  iPSCSGO  at  NASA’s  Numerical  Aerodynamic  Simulator  (NAS)  to 
perform  the  programming  of  this  task  and  the  calculations  of  Task  8. 

Task  8.  Calculate  Flow  Problems  on  the  Distributed-Memory  MIMD 
Computer. 

As  in  Task  6,  we  ran  the  same  suite  of  real-life  engineering  flow  problems  to 
evaluate  the  performance  of  the  zonal  Navier-Stokes  code  on  the  distributed-memory 
MIMD  computer. 

Task  9.  Implement  a  Parallel  Implicit  Algorithms  into  the  Zonal  Navier- 
Stokes  code  for  the  Cray  Y-MP. 

In  order  to  get  comparisons  with  current  supercomputer  speeds,  we  implemented 
the  parallel  implicit  algorithm  into  the  zonal  Navier-Stokes  code  for  execution  on  the 
CRAY  Y-MP. 

Task  10.  Calculate  Flow  Problems  on  the  Cray  Y-MP  Computer. 

The  suite  of  real-life  engineering  flow  problems  was  run  on  the  CRAY  Y-MP  with 
the  number  of  processors  varying  from  1  to  8. 

Task  11.  Evaluate  and  Compare  Results. 

In  this  tcisk  we  evaluated  and  compared  the  results  from  the  previous  tasks.  In 
particular,  the  parallel  speedup  and  efficiency  of  selected  flow  problem  on  each  com¬ 
puter  was  compared.  Trends  were  identified.  Effects  of  flow  conditions,  computer 
architecture,  algorithm  characteristics,  and  any  other  significant  items  were  analyzed 
and  documented  in  section  8. 

Task  12.  Deliver  Software  to  DARPA. 

The  zonal  Navier-Stokes  software  that  is  developed  in  this  contract  will  have  use 
by  others  in  the  DARPA  research  conununity  for  application  to  flow  field  compu¬ 
tations.  There  may  also  be  interest  in  the  DARPA  research  community  to  analyze 
and  study  the  actual  coding  in  the  actual  coding  in  the  zonal  Navier-Stokes  code  in 
order  to  better  understand  the  parallel  implicit  algorithm.  The  use  of  the  code  are 
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be  documented  in  a  user’s  manual.  The  software  is  being  delivered  to  DARPA. 

Task  13.  Reporting,  Documentation,  and  Reviews. 

Results  of  this  work  were  documented  in  semi-annual  technical  reports  and  in  this 
final  technical  report.  R&D  status  reports  and  funds  expenditure  charts  were  deliv¬ 
ered  monthly.  In  order  to  enhance  the  communication  between  DARPA  and  Amtec, 
oral  reviews  were  made  periodically  at  the  DARPA  office  in  Arlington,  Virginia. 
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3  Model  Code 


A  model  code  was  developed  to  simulate  the  operation  count,  memory  accesses,  mem¬ 
ory  storage  requirements,  and  the  ratio  of  serial  to  parallel  code  in  the  Navier-Stokes 
code.  This  development  was  under  tasks  1  through  4  as  described  in  section  2.  Since 
the  model  code  uses  a  parallel  implicit  algorithm  to  solve  a  simple  set  of  Laplace 
equations,  instead  of  the  much  more  complicated  Navier-Stokes  equations,  the  devel¬ 
opment  time  is  much  shorter  and  modifying  the  algorithm  is  much  more  convenient. 
The  model  code  is  written  to  match  the  structure  of  the  Navier-Stokes  code  as  closely 
as  possible. 

The  model  code  solves  the  Laplace’s  equation  (V^<^  =  0)  on  a  three-dimensional 
multiple-zone  Cartesian  grid.  The  internal  points  are  differenced  using  standard  sec¬ 
ond  differences 


Ax^ 


(1) 
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Dirichlet  boundary  conditions  are  used  with  4>  set  to  unity  on  some  of  the  faces  of  the 
zones  and  zero  on  the  other  faces.  The  result,  when  equation  1  is  applied  to  all  cells 
in  the  grid,  is  a  large  sparse  linear  system  of  algebraic  equations.  These  equations 
are  solved  using  point  Gauss-Seidel  relaxation.  This  relaxation  scheme  is  modified 
to  be  a  lower-upper  (LU)  approximate  factorization  as  is  done  for  the  Navier-Stokes 
solver  (see  section  5).  The  procedure  uses  a  diagonal  wavefront  algorithm  with  which 
the  inner  most  loop  is  over  all  points  on  the  i  j  +  k  constant  plane.  All  points  on 
this  plane  are  independent  of  the  other  points  on  the  plane  so  that  they  may  all  be 
updated  simultaneously.  This  allows  the  code  to  vectorize  (or  parallelize)  over  the 
inner  loop.  The  Navier-Stokes  solver  also  uses  a  diagonal  wavefront  algorithm. 

The  model  code  and  the  Navier-Stokes  code  are  multiple-zone  codes  which  use 
multiple  ij,  k  ordered  grids  patched  together  at  common  boundaries.  The  hierarchi¬ 
cal  structure  of  the  codes  reflect  their  multiple-zone  nature.  The  main  routine  is  a 
driver  routine  which  calls  subroutines  to  operate  on  zones.  The  main  operations  are 
"perform  one  iteration  for  a  zone”  and  “transfer  data  between  zones”.  This  model 
fits  nicely  with  the  data  partitioning  and  interprocessor  communication  approach 
required  on  parallel  computers. 

Development  of  the  model  code  was  completed  on  the  Cray  Y-MP,  the  Intel  Touch¬ 
stone  iPSC/860  and  the  Silicon  Graphics  Iris  4-D  machines.  The  following  subsec¬ 
tions  present  results  and  specific  implementation  of  the  domain  decomposition  due  to 
different  computer  architectures  on  the  distributed  memory  MIMD  computer  (Intel 
Touchstone  iPSC/860),  the  Cray  Y-MP,  and  the  SGI  machine. 
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Figure  1:  Domain  Decomposition  on  Distributed  Memory  MIMD  Computer 

3.1  Distributed  Memory  MIMD  Computer 

The  code  was  parallelized  on  the  distributed  memory  computer  using  a  permanent 
domain  decomposition  technique.  The  basic  idea  of  domain  decomposition  is  to  break 
up  the  overall  domain  into  subdomains  (subzones)  which  are  assigned  to  separate  pro¬ 
cessors.  In  this  case,  the  number  of  subzones  is  equal  to  the  number  of  processors 
used  in  the  calculations.  A  schematic  diagram  of  the  domain  decomposition  of  a 
simple  two-dimensional  geometry  is  shown  in  Figure  1.  The  size  of  the  subzones  must 
be  small  enough  to  fit  within  the  available  local  memory  of  the  processor  attached 
to  that  subzone.  Calculations  are  driven  using  a  global  time  step.  Subzones  are  con¬ 
nected  by  means  of  interzone  patches  which  provide  communication  between  adjacent 
subzones.  This  communication  between  the  subzones  on  different  nodes  is  done  by 
copying  the  data  from  the  sending  subzone  to  em  intermediate  array  on  the  same 
processor.  This  array  is  then  sent  to  an  intermediate  array  on  the  receiving  processor 
and  subsequently  used  as  boundary  conditions  in  the  corresponding  subzone  on  the 
receiving  processor. 

However,  this  domain  decomposition  technique  changes  the  relaxation  procedure 
somewhat  because  data  is  lagged  at  the  boundaries  of  the  zones.  We  studied  the  effect 
of  this  modification  on  the  convergence  rate  by  running  a  40  x  40  x  40  calculation 
using  1  through  25  processors  on  the  iPSC/860  computer  (i.e.,  1  to  25  zones).  Figure 
2  shows  the  results  of  the  model  code  with  number  of  zones  varying  from  1  to  25. 
Figure  .3  shows  the  corresponding  convergence  rates.  It  is  seen  that  subdivision  of 
the  zones  reduces  the  convergence  rate  slightly  as  the  number  of  zones  increases. 
However,  the  effect  seems  to  be  small  and  therefore,  can  be  ignored  especially  when 
the  gain  in  speed  up  is  significant. 
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Processors 

Time  (seconds) 

Speed-Up 

Efficiency 

MFlops 

1 

1763.5 

1.00 

100 

4 

440.1 

4.00 

100 

9 

196.0 

9.00 

100 

16 

110.5 

15.96 

100 

25 

73.6 

23.96 

96 

40.75 

32 

59.6 

29.59 

92 

50.32 

36 

53.0 

33.27 

92 

56.59 

49 

41.8 

42.19 

86 

71.75 

64 

32.9 

53.60 

84 

91.16 

Table  1:  Results  of  Model  Code  Calculations  on  Intel  iPSC/860  MIMD  Computer. 

The  Laplace  code  was  run  numerous  times  to  obtain  timings  on  the  iPSC/860 
parallel  computer.  The  results,  which  were  obtained  using  the  vector  option  on  the 
Intel  Portland  Group  compiler  (if77),  are  shown  in  Table  1.  Note  that  the  efficiencies 
are  very  high  for  this  problem.  The  MFlops  were  calculated  from  the  measured  time 
on  the  iPSC/860  and  the  number  of  floating  point  operations  measured  by  the  Cray 
hardware  performance  monitor  for  the  Cray  Y/MP  version  of  the  model  code.  The 
main  contributors  to  the  reduction  in  efficiency  with  increasing  number  of  processors 
are  sequential  code,  load  leveling,  and  interprocessor  communication  overhead. 

Sequential  code  are  sections  of  the  computer  program  that  cannot  be  parallelized. 
The  time  required  to  execute  the  sequential  portion  of  the  code  does  not  decrease 
with  increasing  number  of  processors  and  therefore  becomes  increasingly  significant 
as  the  total  run  time  decreases  with  the  addition  of  more  processors.  Even  though  the 
algorithm  is  designed  to  be  perfectly'  parallel  (i.e.,  no  sequential  portion  whatsoever), 
there  are  redundant  initializations  which  are  performed  by  each  processor  and  thus 
can  be  thought  of  as  sequential  code.  Unfortunately,  the  overall  amount  of  work 
performed  by  the  Laplace  code  is  small  and  hence  the  work  performed  on  this  very 
small  portion  of  sequential  code  is  a  significant  percentage  of  the  total  work. 

The  second  contribution  to  the  reduction  in  efficiency  is  poor  load  leveling.  Load 
leveling  is  making  the  amount  of  work  performed  by  all  processors  equal.  If  the 
amount  of  work  in  not  equal  then  some  processors  will  waste  CPU  time  waiting 
for  other  processors  to  finish,  and  hence  the  overall  performance  is  dictated  by  the 
slowest  processor  (or  processor  with  the  most  work).  The  model  code  automatically 
subdivides  the  grid  into  relatively  equal  subzones.  If  the  subzones  are  all  of  the 
same  size,  the  amount  of  work  performed  by  the  processors  is  nearly  equal  and  the 
load  is  leveled.  Unfortunately,  the  solution  domain  cannot  always  be  subdivided  into 
equally  sized  subzones.  To  elucidate  this  effect,  we  have  performed  calculations  on 
four  different  grid  sizes.  The  original  speed-up  and  efficiencies  are  shown  in  Figures 
4  and  5. 

The  lack  of  load  leveling  can  be  factored  out  to  obtain  the  potential  speed-up  and 
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Efficiency  ^  Speed-Up 


Figure  4;  Speed-up  with  Load  Balancing  Problem 


Figure  5:  Efficiency  with  Load  Balancing  Problem 
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Figure  6;  Speed-up  for  Perfect  Load  Leveling 


efficiency  when  the  work  load  of  each  processors  are  equal, 

_  .  average  zone  size 

efficiency  = - : - : — . 

maximum  zone  size 

The  adjusted  speedup  and  efficiency,  with  the  effect  of  load  leveling  factored  out,  is 
shown  in  Figures  6  and  7.  The  adjusted  efficiency  with  64  processors  varies  from  48% 
for  a  48x48x5  grid  to  62%  for  a  72x72x5  grid.  The  remaining  reduction  in  efficiency 
is  due  to  sequential  code  and  interprocess  communication  overhead.  Note  that  for 
smaller  problems,  the  overall  parallel  performance  is  shown  to  be  degraded.  Therefore, 
to  maximize  a  parallel  performance,  one  must  run  the  largest  problem  possible  on 
each  of  the  processor.  For  small  problems,  use  a  small  number  of  processors  and  for 
large  problems  use  a  .arger  number  of  processors.  However,  if  waill-clock-time  is  the 
primary  interest,  the  more  processors  used  in  the  calculation  the  shorter  the  time. 
For  the  problems  considered,  there  was  never  a  case  where  increasing  the  number  of 
processors  increased  the  run  time. 

The  final  contribution  to  the  reduction  in  efficiency  is  interprocessor  cornmuni- 
cation  overhead.  Interprocessor  communication  adds  an  overhead  which  is  highly 
dependent  on  the  parallel  computer  being  used.  For  the  iPSC/860  the  communi¬ 
cation  rate  between  processors  is  very  fast  so  the  main  overhead  is  in  establishing 
the  communication  link.  The  iPSC/860  therefore  favors  the  passing  of  a  few  large 
sections  of  data  rather  than  many  small  sections  of  data.  For  a  given  size  problem, 
the  interprocessor  communication  overhead  increases  with  the  number  of  processors 
for  two  reasons:  the  total  amount  of  data  transferred  increases  and  the  number  of 
interzone  communication  links  increases  (more  communication  startup  overhead).  In¬ 
terprocessor  communication  overhead  is  believed  to  be  the  biggest  contributor  to  the 
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Figure  7:  Efficiency  for  Perfect  Load  Leveling 
reduction  in  parallel  efficiency  in  Figures  6  and  and  7. 

3.2  Shared  Memory  Cray  Y-MP 

The  procedure  for  parallelizing  on  the  Cray  Y-MP  is  similar  to  the  procedure  used  for 
the  distributed  memory  MIMD  computer.  Both  use  domain  decomposition,  where 
the  overall  domain  (original  zones)  is  broken  up  into  subdomains  (new  zones)  which 
are  assigned  to  separate  processors.  This  subdivision  of  zones  changes  the  relaxation 
procedure  because  data  is  lagged  at  the  boundaries  of  the  zones.  The  effect  of  these 
lagged  boundaries  on  the  convergence  rate  v/as  shown  to  be  small  for  the  iPSC/860 
version,  and  since  the  Cray  Y-MP  has  only  8  processors  this  effect  is  negligible.  The 
parallel  performance  was  measured  for  macrotasked  runs  with  1  to  8  processors.  It 
is  important  to  note  that  these  runs  were  performed  in  the  dedicated  (one  user) 
mode,  since  there  are  no  tools  to  measure  the  performance  of  macrotasked  jobs  in  the 
multiuser  mode.  The  resulting  speedup  and  parallel  efficiency  are  shown  in  Figure  8. 
The  efficiency  varies  from  100%  for  one  cpu  to  23%  for  8  cpu’s.  This  is  a  very  rapid 
drop  in  efficiency  compared  to  the  results  for  the  Intel  Touchs  one  computer.  In  fact, 
the  speedup  in  Figure  8  shows  that  it  is  actually  slower  to  run  with  6  or  8  processors 
than  to  run  with  4  processors. 

It  was  concluded  that  the  poor  parallel  performance  of  the  model  code  on  the 
C ray  Y-MP  is  due  to  a  reduction  in  vector  length  with  zonal  subdivision.  The  Cray 
Y-MP  cpu's  are  highly  dependent  on  vectorization  to  obtain  their  good  performance. 
The  domain  decomposition  used  to  parallelize  the  code  reduces  the  vector  lengths. 
To  test  this  hypothesis  we  ran  the  model  code  on  one  cpu  with  the  grid  divided  into 
subzones  as  it  is  when  run  in  parallel.  The  speedup  and  efficiency  are  then  calculated 
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Figure  S:  Parallel  performance  of  model  code  on  the  Cray  Y-MP  -  calculated  by 
comparing  to  one  zone  on  one  cpu 

by  comparing  to  the  single  cpu  results  with  an  equivalent  number  of  subzones.  The 
results  are  shown  in  Figure  9,  where  p  is  the  number  of  subzones.  Using  this  compar¬ 
ison,  the  parallel  efficiency  varies  between  98%  and  126%.  Clearly,  the  reduction  of 
vector  length  due  to  the  division  of  the  solution  domain  into  subzones  is  the  primary 
cause  of  the  low  performance  in  Figure  8,  not  the  fact  that  multiple  cpu’s  are  used. 

3.3  Shared  Memory  Silicon  Graphics  Iris 

Most  of  the  subroutines  developed  for  Cray  Y-MP  «ind  distributed  memory  MIMD 
computers  were  used  to  develop  a  parallel  version  of  the  model  equations  on  the  Silicon 
Graphics  Iris  4-D  with  8  cpu's.  The  SGI  is  a  shared  memory  MIMD  computer.  The 
parallelization  is  done  by  generating  threads  that  run  on  the  different  processors. 
The  DO  loop  is  the  basic  work  unit  which  is  split  into  concurrent  threads.  Since 
the  compiler  on  this  machine  is  limited  only  to  DO  loop  parallelization,  the  original 
thread  is  the  master,  and  it  controls  the  execution  of  all  other  threads.  By  splitting 
the  DO  loops  surrounding  the  A-level  in  the  driver  routine,  the  code  is  executed  in 
parallel.  The  CSDOACROSS  compiler  directive  is  used  for  the  DO  loop.  The  subzone 
number  is  defined  as  the  LASTLOCAL  variable  where  the  final  value  is  important. 
However,  the  iteration  number,  time  step,  and  convergence  rate  are  defined  to  be 
shared.  Data  for  each  subzone  is  isolated  and  can  only  be  accessed  by  one  thread, 
thus  preventing  memory  access  collisions. 

Based  on  the  i—  and  j— direction  divisions  specified  by  the  user,  each  zone  is 
divided  automatically  into  subzones.  One  of  the  processors  must  perform  the  job 
management  and  i/o  in  addition  to  number  crunching.  This  adds  work  to  one  pro- 


14 


PamlM  Pwtacmano*  <MM<iin2) 


Raqulr^  Cpu 


Figure  9:  Parallel  performance  of  model  code  on  the  Cray  Y-MP  -  calculated  by 
comparing  to  p  zones  on  one  cpu 

cessor  and  could  effect  the  load  leveling  for  small  problems,  but  it  is  insignificant  for 
most  problems.  The  interzone  communication  is  done  by  copying  boundary  condition 
data  into  an  intermediate  array.  Data  in  this  array  is  locked  until  all  processors  finish 
the  time  step.  This  synchronizes  the  calculation  so  that  no  processor  may  start  on 
the  next  time  step  until  all  processors  are  finished  with  the  current  time  step.  If  the 
subdivision  of  the  zones  hcis  a  load  balancing  problem,  or  if  other  processes  delay 
the  execution  of  one  thread,  this  synchronization  will  significantly  reduce  the  parallel 
efficiency.  However,  synchronization  is  necessary  to  prevent  memory  collision  which 
could  easily  happen  in  a  shared  memory  system.  After  the  time  step  is  finished  in 
all  subzones,  data  in  the  intermediate  arrays  are  available  to  complete  the  transfer  of 
data  between  the  subzones. 

The  model  code  was  run  numerous  times  to  obtain  timings  on  the  parallel  Silicon 
graphics  computer.  The  solution  was  the  same  as  that  obtained  on  the  iPSC/860, 
Figure  2,  The  parallel  speed  up  and  efficiency  results  are  shown  in  Figures  10.  The 
test  case  consists  of  40  x  40  x  40  mesh  points,  similar  to  the  one  used  for  timings  on  the 
Cray  Y-MP  ajid  Intel  Touchstone  iPSC/860  machines.  It  is  important  to  note  that 
the  runs  were  performed  in  a  multi-user  mode.  Since  the  maximum  number  of  cpus 
on  the  SGI  machine  is  8,  the  effects  of  the  domain  decomposition  on  the  convergence 
rate  is  negligible.  Unlike  the  Cray  Y-MP,  however,  the  reduction  in  vector  length  from 
the  subdomain  decomposition  does  not  significantly  affect  the  overall  performance  of 
the  code.  The  parallel  speed  up  and  parallel  efficiency  are  excellent  for  up  to  4  cpu‘s. 
The  significant  drop  in  efficiency  of  the  code  for  more  than  4  cpus  is  due  to  memory 
contention,  bus  bandwidth  limitations,  and  cpu-time  competition  from  other  users  on 
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4  Navier- Stokes  Equations  with  Chemistry 

4.1  Transport  Equations 

Tasks  5  through  11  of  this  contract  (see  section  2)  were  performed  using  Amtec’s  ex¬ 
isting  three-dimensional  Navier-Stokes  code.  This  code  solves  the  Reynolds-averaged 
Navier-Stokes  equations  with  user  specified  chemistry,  thermal  nonequilibrium,  and 
a  two-equation  turbulence  model.  These  equations  are  given  below  in  integral  form 

[4]. 

where 

P  =  Fit  +  F23  +  Fsk 

and 


U 
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The  forward  reaction  rate,  kj  ,.,  is  calculated  from  the  Arrhenius  equation  using  user 
supplied  coefficients.  The  equilibrium  constant,  Kc,ri  is  calculated  from  user  supplied 
curve  fits  to  the  species  enthalpy  and  entropy. 

The  vibrational-electronic  energy  source  term  is  given  by  the  Landau- Teller  for¬ 
mula, 


t'lt 


(3) 


where  the  time  constant  is  given  by  the  Landau-Teller  expression, 


6,r'^exp  {{b,/Tf*) 
pil  -bse-^^l^) 


(4) 


and  the  equilibrium  vibrational  energy,  is  evaluated  in  terms  of  the  translation- 
rotational  temperature,  T,  using  the  curve  fits  described  in  section  4.2.  Equations 
3  and  4  are  a  simple  relaxation  of  the  mixture  vibration-electronic  energy  toward 
equilibrium. 

For  turbulent  flows  there  are  two  options  for  turbulence  models:  the  Baldwin- 
Lomax  algebraic  model  [5]  and  the  k  —  e  two-equation  model  [6].  When  the  k  —  e 
turbulence  model  is  chosen,  the  Icist  two  transport  equations  in  equations  2  are  solved. 
The  pertinent  constants  and  relations  for  this  model  are  as  follows. 
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The  values  of  Cj, ,  Cej,  <7^,  <t<,  and  C^^  may  be  specified  by  the  user.  The  above  values 
are  the  defaults. 

Physically  Equation  1  represents  a  very  simple  idea:  the  time  rate  of  change  of 
mass,  momentum,  and  energy  within  an  arbitrarily  chosen  volume,  V,  is  equal  to 
the  apparent  flux  of  these  quantities  inward  through  the  surface,  5,  surrounding  the 
volume  plus  the  production  of  these  quantities  within  the  volume.  The  finite  volume 
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Figure  11:  Finite  Volume  Mesh 


method  consists  of  breaking  the  flow  field  up  into  a  large  number  of  finite  volume  cells, 
ais  shown  in  Figure  11,  and  applying  the  integral  equations  directly  to  each  volume. 


4.2  Thermodynamics 

The  equations  in  the  previous  section  must  be  supplemented  by  an  equation  of  state 
to  calculate  p  and  T  from  p  and  e.  The  code  offers  three  options  for  equations  of 
state:  a  single  species  perfect  gas,  Tannehill’s  curve  fits  for  real  air,  or  a  mixture  of 
thermally  perfect  gases.  The  latter  option  must  be  used  if  multiple  species  or  thermal 
nonequilibrium  is  considered. 

In  addition  to  calculating  p  and  T,  it  is  necessary  to  calculate  five  thermodynamic 
variables  which  are  needed  to  calculate  the  numerical  fluxes.  The  first  two  are  7  and 
7.  For  perfect  gas  flows  7  and  7  are  identical  and  are  equal  to  the  ratio  of  specific 
heats.  7  =  ^.  For  real  gases,  however,  the  ratio  of  specific  heats  no  longer  has  any 
physical  significance  and  the  quantities  7  and  7  are  defined  so  as  to  simplify  the 
calculation  of  numerical  fluxes.  In  particular  we  use  Vinokur’s  definition  [7]. 

p<? 

P 

p{e  -  eo) 

For  numerical  fluxes  based  on  Roe’s  flux  function  Vinokur  has  defined  the  alternative 
variables  k  and  y  which  are  the  partial  derivatives  of  pressure  p  with  respect  to 


7  = 

7  = 


internal  energy  per  unit  volume  e  and  density  p. 


?E. 

de 

dp 


The  final  state  variable  is  the  mean  specific  heat  Cu  = 


4.2.1  Perfect  Gas 

The  perfect  gas  equation  of  state  is  expressed  with  the  following  equations: 

p  =  p(7-l)e  (5) 

T  =  -  (6) 

Cy 

The  equation  of  state  can  be  altered  for  different  gases  by  specifying  the  ratio  of  spe¬ 
cific  heats,  7,  and  the  gas  constant,  R.  The  gas  constant  for  a  gas  of  a  given  molecular 
weight.  A/,  is  calculated  from  the  universal  gas  constant  R  =  8.31434J/mole  K  using 
the  formula  R  =  R/M.  The  specific  heat,  Cv  is  calculated  using  Cu  = 

The  remaining  five  needed  state  variables  are  given  by  the  following  formulas. 

7  =  specified 

7  =  7 

K  =  7—1 

X  =  0 

R 

=  Cv  = - r 

7-1 


4.2.2  Equilibrium  Air  Curve  Fits 

The  second  option  for  equation  of  state  is  Tannehill’s  curve  fits  for  real  air.  With  this 
option  the  code  calls  a  subroutine  given  in  reference  [8]  to  calculate  p  and  T  from 
p  and  e  for  each  finite-volume  cell  and  each  time  step.  This  routine  contains  curve 
fits  for  p,  T,  and  c  in  terms  of  p  and  e  obtained  from  detailed  calculations  of  air  in 
thermochemical  equilibrium.  The  curves  are  valid  up  to  25,000K. 

The  remaining  five  needed  state  variables  must  be  calculated  from  the  outputs 
of  the  equilibrium  air  subroutine.  In  addition  to  pressure  p  and  temperature  T,  the 
subroutine  also  returns  the  speed  of  sound,  c.  With  this,  amd  the  fact  that  eo  =  0  we 
can  calculate  three  variables  directly  from  their  definitions. 
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The  pressure  derivatives,  k  and  x  cannot  be  directly  calculated  using  data  returned 
from  the  subroutine.  Some  numerical  differentiation  is  required. 

To  evaluate  k  and  Xi  start  with  the  chciin  rule 


6p  = 


Sp  =  nSe  +  x^P 


and  the  expression  for  speed  of  sound  in  terms  of  k  and  x  (equation  5  in  [7]) 


c^  =  X  + 


Eliminating  x  from  the  two  equations  and  solving  for  /c  yields 

Sp  —  c^Sp 
Se  —  hSp 

_  p(p  +  Sp,e  +  Se)  —  p(p,  e)  -  c^Sp 
Se  —  hSp 

This  expression  may  be  used  to  calculate  k  from  any  combination  of  Sp  and  Se.  After 
some  experimentation,  Se  =  2hSp  was  chosen  since  it  seems  to  avoid  most  spurious 
results  near  discontinuities  in  the  curve  fits.  The  other  pressure  derivative,  Xt  is  then 
calculated  from  x  =  c^  —  nh. 


4.2,3  Thermally  Perfect  Species  -  Thermal  Equilibrium 

The  third  option  for  equation  of  state  is  for  thermally  perfect  species  in  either  thermal 
equilibrium  or  thermal  non-equilibrium.  This  option  requires  curve  fits  in  terms  of 
temperature  for  the  specific  heat  at  constant  pressure,  Cp,  the  enthalpy,  /i,  and  the 
entropy,  s,  for  each  species.  These  curve  fits  are  expressed  as  polynomials  for  up  to 
three  user  specified  temperature  ranges. 
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The  temperature  ranges  chosen  for  most  of  the  cases  considered  in  this  research  effort 
are  300K  to  lOOOK,  lOOOK  to  6000K,  and  6000K  to  15000K.  If  the  temperature 
exceeds  the  maximum  temperature  of  the  upper  range  or  the  minimum  temperature 
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of  the  lower  range,  the  specific  heat  is  assumed  to  be  constant  at  the  value  calculated 
for  the  closest  temperature  within  the  valid  temperature  range.  The  out  of  range 
enthalpy  and  entropy  curves  are  then  calculated  by  appropriate  integrations  of  Cp, 
and  matching  of  values  at  the  edge  of  the  valid  temperature  range. 

For  thermal  equilibrium  an  iteration  is  necessary  to  obtain  the  temperature,  T, 
from  the  species  densities,  p,,  and  internal  energy,  e,  using  the  above  curve  fits.  The 
formula  for  internal  energy  in  terms  of  temperature  for  a  given  set  of  species  densities 
is  obtained  from  the  definition  of  enthalpy,  .r  =  e  +  p/p,  the  thermal  equation  of 
state,  p,  =  jfRT,  and  the  the  curve  fit  for  enthalpy  from  equation  7.  The  resulting 
expression, 


is  a  nonlinear  equation  for  T  which  must  be  solved  iteratively.  A  Newton  iteration 
is  used  to  approximately  solve  this  equation  for  each  cell  on  each  time  step.  The 
temperature  from  the  previous  time  step  is  used  as  an  initial  condition  for  the  Newton 
iteration. 

For  the  evaluation  of  the  Steger  and  Warming  fluxes  7,  d;,,  and  7  are  needed.  The 
speed  of  sound  used  for  the  calculation  of  7  is  the  frozen  chemistry  speed  of  sound 
defined  by 


where  c,  =  p^/p  is  the  mass  fraction  of  species  i.  Using  this  and  previously  given 
definitions,  the  resulting  expressions  for  a  thermally  perfect  mixture  are 


Here 


7 

7 


R  = 


Cfj  — 


NS  b 
ft 

5=1  ^ 

NS 

'Lc.c., 

5=1 


Note  that  the  expressions  for  7  and  7  are  very  similar  except  that  the  actual  specific 
heat.  Cy,  is  used  for  7  while  the  mean  specific  heat,  6^  is  used  for  7. 
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4.2.4  Thermally  Perfect  Species  -  Thermal  Nonequilibrium 

For  thermal  nonequilibrium  we  must  consider  the  detailed  contributions  to  the  in¬ 
ternal  energy.  It  is  generally  assumed  [9,  10,  11,  12]  that  the  internal  energy  is  the 
sum  of  independent  energies  from  translational  motion,  rotation  motion,  electronic 
excitation,  and  vibrational  excitation. 

e  =  €(  -t-  Cr  +  Ce  -f-  e„ 

In  general,  each  of  these  components  may  have  an  independent  temperature  associ¬ 
ated  with  it  whereas  in  equilibrium  all  of  the  temperatures  are  the  same.  For  most 
calculations  it  is  reasonable  to  cissume  that  the  rotational  mode  is  in  equilibrium  with 
the  translational  mode  [10,  11]  since  they  both  equilibriate  very  fast  in  comparison  to 
the  vibrational  mode.  This  allows  these  two  energies  to  be  considered,  for  all  practi¬ 
cal  purposes,  eis  one  energy,  etr  =  et  +  Cr,  which  is  described  by  the  temperature  T. 
Likewise,  the  electronic  and  vibrational  energy  levels  may  be  considered  to  be  in  equi¬ 
librium  and  are  described  by  the  combined  energy  e„e  =  e*  +  e„  and  the  temperature 
r„.  Furthermore,  it  is  assumed  that  the  vibrational/electronic  modes  of  all  species 
are  in  equilibrium  with  one  another  and  with  the  electron  translational  energy  mode 
at  the  temperature  Ty. 

The  goal  of  this  section  is  to  describe  how  the  temperatures  T  and  Ty  are  calculated 
from  the  species  densities  p,,  total  internal  energy  e,  and  vibrational/electronic  energy 
e„j.  To  do  this  the  functional  form  of  the  specific  heats  for  the  two  modes  in  terms 
of  their  respective  temperatures  is  needed.  Since  the  energy  modes  are  independent 
we  may  write  for  the  case  of  equilibrium 

CV.{T)  =  Cy„  .{T)  -f  Cy^,JT) 
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The  variation  of  c„,  is  easily  obtained  for  each  species  from  the  curve  fits  of  equation 

7. 

=  Ui  —  1  +  (I2T  + 

K 

Since  the  code  is  generally  used  to  analyze  flows  of  gas  at  high  temperatures,  the 
rotational  modes  of  diatomic  or  polyatomic  gases  are  assumed  to  be  fully  excited. 
The  species  specific  heat  for  the  translational  and  rotational  modes  is  therefore  inde¬ 
pendent  of  temperature  and  given  by 

^  =  1.5  +  d  (9) 

where  d  is  the  number  of  rotational  degrees  of  freedom.  For  monatomic  molecules 
d  =  0,  for  diatomic  molecules  d  =  2,  and  for  non-inline  polyatomic  molecules  d  = 
3.  The  remaining  contributions  to  specific  heat  are  from  vibrational  and  electronic 
excitation. 

— =  Ui  —  2.5  —  d  -f-  02^  "b  -b  cl$T^ 

H 

This  approach  to  calculating  ,  differs  considerable  from  the  approach  taken  by 
Candler  [12]  and  Park  [11].  They  both  used  approximate  formulas  for  ^  and 
based  on  fundamental  physics.  However,  their  approach  is  more  complex  and  should 
yield  no  better  accuracy  than  the  approach  taken  here.  Furthermore,  the  current 
approach  has  the  advantage  of  keeping  the  physics  used  to  calculate  the  equilibrium 
constants  for  the  chemical  reactions  consistent  with  the  physics  used  to  calculate 
the  temperatures.  Both  Candler  and  Park  use  simple  curve  fits  for  the  backward 
reaction  rates  which  is  less  accurate  and  less  general  than  calculating  the  backward 
reaction  rates  from  the  equilibrium  constants  and  the  equihbrium  constants  from  the 
thermodynamics  as  is  done  in  this  investigation. 

Once  the  variation  of  the  specific  heat  ^  with  is  known  we  may  obtain  the 
variation  of  with  T^e  for  a  species  by  integration. 

^  (To) 

=  b^,  +  {a„-2.5-d,)T„  +  ^Ti  +  ^1^,  +  ^1l  +  ^Tl  (10) 

(11) 

where  Tq  is  some  temperature,  within  the  temperature  range  for  the  curve,  at  which 
the  value  of  e^e,  is  known  and 

K.  =  e„.  (r„)  -  [(a„  -  2.5  -  d.)  T„  +  +  ^7?  +  ^r„']  (12) 

As  mentioned  previously  the  curve  fits  are  generally  broken  into  three  separate  poly¬ 
nomials  for  three  temperature  ranges.  The  constant,  Ag,,  is  first  calculated  using  the 
curve  for  the  temperature  range  containing  7Ve/,  the  reference  temperature,  where 


=  0.  The  calculation  is  performed  using  equation  12  with  To  =  and  =  0. 
The  constant  is  then  calculated  for  neighboring  curves,  using  equation  12  by  re¬ 
quiring  that  e„e,  be  continuous  at  the  boundary  between  two  curves. 

Equation  10  is  an  expression  for  the  species  vibrational  energy  e^e,  in  terms  of  the 
vibrational  temperature  After  making  the  appropriate  sums  the  expression  for 
the  mixture  vibrational  energy  e„e  in  terms  of  the  vibrational  temperature  T^g  is 


This  expression  is  a  nonlinear  equation  for  r„g  which  must  be  solved  iteratively.  A 
Newton  iteration  is  used  to  approximately  solve  this  equation  for  each  cell  on  each 
time  step.  The  vibrational  temperature  from  the  previous  time  step  is  used  as  an 
initial  condition  for  the  Newton  iteration. 

The  equation  for  the  species  traiislation/rotational  energy  etr,5  in  terms  of  trans¬ 
lational/rotational  temperature  T  is  obtained  by  integrating  equation  9  for  c^,,^  and 
requiring  that,  when  in  equilibrium,  -f  -I- 1  give  the  original  curve  fit,  equation 
7.  The  result  is 

——  =  dgi  —  b^a  +  (1-5  -h  da)  T  (14) 

When  summed  appropriately  this  expression  gives  the  following  equation 

NS  p 

«  -  Cve  =  — •T-p[a65  -  ^5  +  (1-5  +  d,)T]  (15) 

This  is  a  linear  algebraic  equation  which  may  be  solved  directly  for  T. 

The  speed  of  sound  used  to  evaluate  7  is  the  frozen- chemistry,  frozen  vibrational- 
electronic  energy  speed  of  sound  defined  by 


Using  this  and  previously  given  definitions,  the  resulting  expression  for  a  thermally 
perfect  mixture  in  thermal  nonequilibrium  are 


Cv 


tr 


Note  that  c„,^  is  the  mixture  translational-rotational  specific  heat.  This  value  is 
independent  of  temperature  but  depends  on  the  species  concentrations. 

With  the  addition  of  thermal  nonequilibrium  there  are  now  two  k’s;  one  for  each 
independent  internal  energy. 
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As  before,  there  is  a  x  defined  for  each  species. 


The  expressions  for  the  pressure  derivatives  for  the  mixture  of  thermally  perfect 
species  is 


R  , 

m; 

R 

Ktr 

z=  - 

Cv., 

Ky 

=  0 

The  pressure  derivative  with  respect  to  the  vibrational-electronic  energy,  e„e.  is  zero 
because  the  pressure  is  not  a  function  of  the  vibrational-electronic  temperature,  T^. 
If  the  translational  energy  of  the  free  electrons  is  in  equilibrium  with  the  vibrational- 
electronic  energy  of  the  other  species  rather  than  their  translational-rotational  inter¬ 
nal  energy,  the  pressure  will  depend  on  Ty  and  the  above  expressions  will  be  more 
complicated. 


4.3  Transport  Properties 

The  transport  properties  are  the  viscosity  fi,  the  conductivity  k,  and  the  binary 
diffusion  coefficient  T>.  These  quantities  are  calculated  within  the  code  using  various 
user  specified  methods.  These  methods  are  summarized  below. 

4.3.1  Viscosity 

The  viscosity  is  calculated  using  one  of  three  methods.  The  first  method  is  to  assume 
that  the  viscosity  depends  only  on  temperature  and  that  the  dependence  may  be 
approximated  by  the  formula 

^  A^sT  -t- 

The  famous  Sutherland’s  law  is  obtained  by  setting 

=  1.451  X  10-® 

=  1.5 
=  0 
=  0 
=  1 
=  110. 
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The  second  method  for  calculating  the  viscosity  is  the  equilibrium  air  curve  fits  of 
Tannehill  [13].  The  third  and  final  method  for  calculating  viscosity  is  to  use  Blottner’s 
formula  for  the  viscosity  of  individual  species 

fis  =  0.1  exp  [(.4^  In  r  +  B'^)  \nT  +  , 

and  then  use  Wilke’s  mixing  formula  to  obtain  the  mixture  viscosity.  The  Wilke’s 
serniempirical  mixing  rule  is 

s  0a 

where  is  the  mole  fraction  of  species  s  and 

=  i:  [i  +  M  [s  (i  +  (16) 

m  L  V  V  iVf,  /  J  L  V  MmJl 

This  latter  method  was  used  by  Candler  for  thermochemical  nonequilibrium  calcula¬ 
tions  [12]. 

4.3.2  Conductivity 

Conductivity  is  calculated  using  one  of  four  methods.  The  first  is  to  assume  a  Prandtl 
number  Pr  and  calculate  the  conductivitv  from 


This  method  is  always  used  for  turbulent  flows.  A  second  is  to  assume  that  the 
conductivity  is  a  function  of  temperature  only  and  use  the  same  sort  of  formula  as 
used  for  viscosity. 

k  =  +  AfcjT  -t-  Ak^ 

+  Akf, 

The  third  method  for  calculating  the  conductivity  is  the  equilibrium  air  curve  fits 
of  Tannehill  [13].  The  final  option  for  calculating  conductivity  is  to  use  the  Eucken 
formula  to  calculate  the  species  conductivity  from  the  species  viscosity, 

^ve,  =  Ms  I've. » 

and  then  determine  the  mixture  conductivity  using  Wilke’s  formula, 

,  Xg  k, 

s 

Y  L' 

f;  -  Y' 

'''  '  A. 

s  <Pa 


where  0,  is  given  in  equation  16. 
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4.3.3  Diffusion  coefficient 

The  species  diffusion  coefficients  are  calculated  using  an  expression  from  Lee  [10]. 

*  1-X, 

The  diffusion  coefficient  V  is  calculated  from  a  specified  Schmidt  number  Sc- 


The  default  Schmidt  number  is  0.5.  The  diffusion  coefficient  for  ions  is  generally 
assumed  to  be  double  the  neutral  species  diffusion  coefficient  because  of  the  existence 
of  an  electric  field.  This  is  accomplished  in  the  code  by  specifying  a  different  Schmidt 
number  for  each  species  {S^  =  0.5  for  neutral  species,  Sc,  =  0.25  for  ions).  The 
species  diffusion  coefficient  is  then  given  by 

Tf  _  —  Cj) 

'  1  -  pSc. 

The  electron  diffusion  coefficient  is  often  treated  specially  [10],  but  here  it  is  calculated 
like  any  other  species. 

4.4  Boundary  Conditions 

The  boundary  conditions  are  an  extremely  important  aspect  of  the  problem  formu¬ 
lation.  At  each  boundary  of  the  flow  domain  boundary  conditions  must  be  applied 
based  on  the  physical  nature  of  the  boundary.  The  code  heis  options  for  several  types 
of  boundary  conditions.  These  boundary  conditions  are  discussed  below. 

4.4.1  Free-Slip  Walls 

When  an  inviscid  analysis  is  being  performed,  or  when  a  known  stream  surface  is 
chosen  as  a  boundary  of  the  computed  flow  domain,  a  free-slip  boundary  condition  is 
specified.  The  boundary  condition  is  simply  that  there  is  no  flow  through  the  wall. 
This  means  that  the  component  of  velocity  normal  to  the  surface  is  zero  and  that 
the  component  of  velocity  tangent  to  the  surface  is  allowed  to  change  as  required 
by  the  governing  equation.  The  presence  of  a  free  slip  wall  also  affects  the  pressure 
and  temperature  fields.  If  the  wall  is  curved,  a  non-zero  pressure  gradient  is  needed 
normal  to  the  wall  to  turn  the  flow  in  the  direction  that  the  wall  is  curving.  This 
pressure  gradient  may  be  calculated  from  the  normal  momentum  equation  at  the  wall 
and  used  to  set  the  wall  pressure.  However,  in  most  analyses,  the  viscous  nature  of 
the  flow  near  actual  walls  is  generally  so  significant  that  a  no-slip  boundary  condition 
is  necessary.  The  free-slip  boundary  condition  then  should  be  used  only  along  stream 
surfaces,  such  as  planes  of  symmetry,  which  are  not  curved.  The  gradients  of  pressure 
and  temperature  normal  to  free-slip  walls  are,  therefore,  assumed  to  be  zero. 
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Heat  conducting  solid  - 

Inner  wall  at  temperatura  T,. 


Figure  12;  Types  of  heat  transfer  to  the  wall 
4.4.2  No-Slip  Walls 

For  flows  of  gases  at  all  but  the  lowest  densities  the  velocity  at  walls  is  essentially 
zero.  Furthermore,  for  high  Reynold’s  number  flows,  the  normal  pressure  gradient  is 
approximately  zero  deep  within  the  boundary  layer.  This  leaves  the  specification  of 
boundary  conditions  for  translational-rotationed  temperature,  vibrational-electronic 
temperature,  species  concentrations,  turbulent  kinetic  energy,  and  turbulent  energy 
dissipation  rate  at  the  wall. 

The  wall  temperatures  may  either  be  specified  (isothermal)  or  determined  from 
an  analysis  of  the  various  heat  transfers  to  and  from  the  wall.  For  all  but  adia¬ 
batic  walls  the  vibrational-electronic  energies  are  assumed  to  in  equilibrium  with  the 
translational-rotational  energies  at  the  wall  so  that  Tve  =  T.  For  adiabatic  walls  the 
gradient  of  the  vibrational  energy  is  zero,  =  0,  where  y'  is  the  distance  normal 
to  the  wall. 

The  possible  forms  of  heat  transfer  to  the  surface  of  the  wall  are  shown  in  Figure 
12.  In  general  they  include  heat  conduction  to  the  surface  from  the  gas,  beat  con¬ 
duction  to  the  surface  from  within  the  solid  wall,  and  radiation  of  heat  away  from 
the  surface.  The  first  law  of  thermodynamics  requires  that  the  sum  of  the  heat  fluxes 
into  the  surface  be  zero. 

9ff  +  93+9r  =  0  (17) 

By  writting  qg,  and  qr  in  terms  of  the  wall  temperature  and  temperature  gradient, 
equation  17  may  be  used  to  determine  the  wall  temperature. 

The  simple  adiabatic  wall  boundary  condition  is  obtained  when  the  wall  is  a 
nonradiating  insulator.  In  this  case  ?,  =  9r  =  9?  =  0-  The  heat  conduction  within 
the  gas  is  given  by  Fourier’s  law  of  heat  conduction,  qg  =  —  A:|^,  for  which  the  heat 


30 


conduction  is  directly  proportional  to  the  normal  gradient  of  temperature.  If  =  0 
the  normal  gradient  of  temperature  is  zero. 

When  the  wall  is  not  a  perfect  thermal  insulator,  or  the  wall  emits  or  absorbs 
thermal  radiation,  the  boundary  condition  for  the  wall  temperature  becomes  more 
complex.  Wall  heat  conduction  can  get  quite  complex,  depending  on  the  wall  geome¬ 
try  and  the  properties  of  the  wall  material.  Likewise,  the  radiative  heat  transfer  to  the 
wall  can  also  be  very  complex  depending  on  the  geometry  of  the  surface,  temperatures 
of  the  gas,  and  composition  of  the  gas.  A  complete  treatment  of  these  phenomena 
is  beyond  the  scope  of  this  research  effort.  However,  simple  approximations  for  the 
wall  heat  conduction  and  radiative  heat  transfer  are  included  for  the  estimation  of 
hypersonic  vehicle  wall  temperatures.  The  wall  heat  conduction  is  assumed  to  be 
through  a  thin  skin  of  thickness  uniform  conductivity  and  no  heat  capacity. 
Furthermore,  the  temperature  at  the  inner  surface  of  the  skin,  T,u,,  is  assumed  to  be 
constant.  The  expression  for  the  wall  heat  flux  per  unit  area  is  then 
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T  -  T 

■*-W  -*11 

^wall 


(18) 


The  wall  radiative  heat  transfer  is  assumed  to  be  between  the  wall  and  some  distant 
body  of  temperature  .Also,  it  is  assumed  that  there  is  no  interaction  of  the 

radiation  with  the  gas.  The  radiative  heat  flux  per  unit  area  is  then 

qr  =  —7  -  au,T^^]  (19) 


Here  a  is  the  Stefan  Boltzmann  constant,  is  the  emissivity  of  the  wall,  and  Oyj 
is  the  absorptivity  of  the  wall.  This  expression  excludes  gas  to  wall  radiative  heat 
transfer  which  can  contribute  significantly  to  the  surface  heat  flux  balance. 

Substituting  the  expressions  for  q'p,  <7j,  and  qr  into  equation  17  we  get  the  following 
equation. 
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(20) 


This  is  a  diflerential  expression  for  the  wall  temperature  7L  which  is  coupled  to  the 
temperature  field  through  the  normal  temperature  derivative  .  The  numerical 
treatment  of  this  equation  is  described  elsewhere  in  this  report. 

For  flows  with  species  transport,  the  walls  are  assumed  to  be  non-catalytic  so 
that  ^  =  0.  The  validity  of  this  a,ssumption  will  depend  on  the  nature  of  the  wall 
material  and  the  degree  to  which  the  flow  is  out  of  equilibrium. 


4.4.3  Rarefied  Gas  Slip  Walls 

Tile  rarefied  gas,  slip  wall  boundary  condition  is  included  to  enable  analysis  of  hy¬ 
personic  vehicles  at  higher  altitudes  than  would  otherwise  be  possible.  At  the  high 
altitudes  and  hypersonic  velocities  at  which  many  aerospace  vehicles  are  expected  to 
operate,  the  Knudsen  numbers  get  large  and  the  continuum  assumption  begins  to 
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break  down.  At  the  edge  of  the  continuum  limit,  the  continuum  assumption  fails  first 
in  a  region  next  to  the  wall  having  a  thickness  on  the  order  of  the  local  molecular 
mean-free  path.  In  this  layer,  called  the  Knudsen  layer,  the  Navier-Stokes  equations 
no  longer  apply.  A  rigorous  treatment  of  the  flow  within  the  Knudsen  layer  would 
require  the  solution  of  the  Boltzmann  equation.  However,  such  a  procedure  is  beyond 
the  scope  of  this  work,  and  would  likely  be  very  expensive. 

If  the  Knudsen  layer  is  thin  the  flow  field  may  be  obtained  by  solving  the  equations 
of  continuum  flow,  the  Navier-Stokes  equations  in  this  case,  with  modified  boundary 
conditions  [14,  15,  16,  17].  In  particular,  a  jump  or  slip  in  the  wall  values  of  species 
concentrations,  pressure,  velocity,  and  temperature  must  be  allowed.  The  code  uses 
simplified  versions  of  the  surface  slip  equations  derived  by  Gupta,  Scott,  and  Moss 
[17]. 

The  jump  relations  are  used  for  two  purposes;  to  obtain  boundary  conditions 
at  the  edge  of  the  continuum  flow  domain,  and  to  obtain  wall  values  of  pressure, 
temperature,  etc.  for  force  integrations  and  thermal  load  estimations.  For  example, 
the  pressure  jump  is  not  actually  needed  for  the  boundary  condition  but  is  needed  to 
determine  the  wall  pressure  for  force  integrations.  Conversely,  the  temperature  jump 
is  needed  for  the  boundary  condition  when  either  the  wall  temperature  is  specified  or 
a  radiating  and/or  conducting  wall  is  considered.  The  slip  velocity  relation  is  always 
used.  The  boundary  conditions  are  actually  the  conditions  at  the  edge  of  the  Knudsen 
layer  and  will  be  indicated  by  the  subscript  s. 

To  remain  consistent  with  the  available  no-slip  boundary  conditions,  the  rarefied- 
slip  boundary  is  assumed  to  be  non-catalytic.  Therefore 


Likewise,  the  pressure  gradient  at  the  edge  of  the  Knudsen  layer  is  eissumed  to  be 
zero.  The  reasoning  is  the  same  cis  for  the  no-slip  wall. 

For  the  slip  relations  vve  define  a  local  coordinate  system  with  y'  normal  to  the 
surface  and  x'  and  c'  tangent  to  the  surface.  The  velocity  slip  relationship  is  obtained 
from  equations  40  and  41  in  reference  [17]  by  eissuming  that  the  derivatives  along  the 
surface  are  small  compared  to  normal  derivatives.  The  result  is 
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The  normal  velocity,  u'  is  of  course  zero  because  the  wall  is  impermeable. 
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The  temperature  boundary  condition  is  different  depending  on  whether  the  wall 
is  adiabatic  or  otherwise.  If  it  is  adiabatic  the  total  energy  flux  to  the  wall  must  be 
zero. 

'dT  (  ,du'  ,dw'' 
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=  0 


(24) 


Since  the  tangential  velocities  are  no  longer  zero  at  the  boundaries,  the  work  term  is 
not  zero  and  the  temperature  derivative  is  not  zero. 

If  the  wall  is  not  adiabatic,  the  following  temperature  jump  equation,  obtained 
from  equation  43b  of  reference  [17]  is  used. 
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Here  Py  is  the  normal  momentum  flux,  Py  =  p*  +  Tyy,  which  is  approximately  equal 
to  the  pressure,  p'*,  if  the  Knudsen  Layer  is  thin.  When  the  wall  temperature  Tu, 
is  specified,  the  above  equation  is  used  directly  cis  an  expression  for  the  boundary 
temperature  Tj.  When  the  wall  is  radiating  and/or  conducting,  the  above  equation  is 
used  in  conjunction  with  the  energy  balance  equation,  equation  17.  This  equation  is 
used  with  qg  given  by  equation  24,  q,  given  by  equation  18,  and  q^  given  by  equation 
19.  The  resulting  expression  is 
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(26) 


Equations  25  and  26  are  two  nonlinear,  differential  expressions  for  T,  and  T^.  To¬ 
gether,  they  constitute  the  temperature  boundary  condition  for  rarefied-gas,  radiating 
and/or  conducting  walls. 


4.4.4  Supersonic  Inflow 

The  supersonic  inflow  boundary  condition  requires  specification  of  all  variables  by  the 
user.  This  boundary  condition  is  only  appropriate  where  the  Mach  number,  based  on 
the  velocity  at  the  boundary,  is  greater  than  one. 

4.4.5  Supersonic  Outflow 

Along  a  supersonic  outflow  boundary  nothing  may  be  specified  by  the  user.  The 
flow  variables  along  such  a  boundary  are  completely  determined  by  the  upstream 
flow.  Strictly  speaking,  this  boundary  condition  is  only  appropriate  when  the  Mach 
number,  based  on  the  velocity  normal  to  the  boundary,  is  greater  than  one.  However, 
its  use  is  acceptable  for  Mach  numbers  less  than  one  when  the  low  velocity  is  due  to 
a  thin  boundary  layer  along  a  nearby  wall. 
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4.4.6  Mixed  Subsonic/Supersonic  Outflow 

The  treatment  of  subsonic  outflow  boundary  conditions  is  guided  by  the  theory  of 
characteristics.  For  a  subsonic  flow  at  the  exit  the  u'  —  c  characteristic  propagates 
information  upstream  from  the  boundary  cell  to  the  interior  cell.  In  this  case,  one 
variable  must  be  specified  at  the  boundary  cell.  Currently,  a  constant  static  pressure 
is  specified  at  the  outflow  boundary. 


SpB  =  0 

The  remaining  variables  in  the  boundary  cell  are  calculated  using  the  four  downstream 
running  characteristic  equations  (thermal  equilibrium  and  no  chemistry).  These  equa¬ 
tions,  written  in  delta  form,  are 


+  -^^Pb 
SpB  +  pcSug 
Sv'g 
Sw'b 
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h  ■ 
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-Pl  +  -^(PB  -  P/)|  =  Rl 


IpB  -  PI  +  pc  {u'b  -  U/)]  =  R2 
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The  above  equations  are  five  linear  algebraic  equations  in  the  five  unknowns  6pB,  Su'g, 
8vg,  Sw'g,  and  SpB-  This  system  is  solved  directly  and  the  boundary  cell  solution  is 
updated. 

When  there  is  a  secondary  energy  equation,  for  vibrational-electronic  energy,  the 
vibrational  energy  is  simply  extrapolated.  The  same  is  true  of  the  species  mass  frac¬ 
tions  and  turbulence  quantities  {k  and  e).  This  is  consistent  with  the  theory  of  char¬ 
acteristics  since  the  vibrational  energy  and  species  transport  equations  corresponds 
to  a  u'  characteristic. 


4.4.7  Subsonic  Inflow 

For  subsonic  inflow  the  stagnation  pressure,  stagnation  temperature,  and  flow  di¬ 
rection  are  specified.  These  quantities  are  related  to  the  static  pressure  and  static 
temperature  by  the  following  equations: 
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li  =  (t^)„  = 

-—  =  (77-)  =  specified  (27) 

The  first  two  equations  above  are  simply  the  isentropic  relations  written  in  terms  of 
the  total  velocity,  V^o/>  and  the  speed  of  sound  at  a  sonic  throat,  a..  The  speed  of 
sound  at  a  sonic  throat  is  calculated  from  the  specified  stagnation  temperature. 

,  ,2  2/?  pt 

(a,)  =  - 

^  ’  1  +  1  Pt 

Equations  27  are  a  system  of  five  equations  in  five  unknowns:  p,  p,  u,  v.  and  w,  but  one 
of  the  last  three  equations  is  redundant.  To  complete  the  system  another  equation 
is  needed.  This  is  to  be  e.xpected  since  there  is  an  upwind  running  characteristic 
carrying  information  out  of  the  flowfield  interior  to  the  inflow  boundary. 

The  last  equation  to  close  the  system  is  the  characteristic  relation  carrying  infor¬ 
mation  upstream  to  the  inflow  boundary. 
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This  equation  is  forward  differenced  in  time. 

SpB  -  pcSu'g  =  \pi  -PB  -  pc  {u'l  -  ug)]”  (28) 

The  .subscripts  1  and  B  indicate  the  first  interior  cell  and  the  inflow  boundary  cell, 
respectively.  The  prefix  S  indicates  the  forward  in  time  difference  of  the  variable 
following  it. 

The  number  of  unknowns  is  reduced  to  three  if  the  isentropic  relations,  the  first 
two  of  Equations  27  are  written  in  terms  of  u'.  This  is  done  using  the  relation 

Vl,  =  (29) 

where 
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The  modified  isentropic  relations  are 
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The  modified  isentropic  relations,  Equations  30,  and  the  discrete  form  of  the  upstream 
running  characteristic  relation.  Equation  28  are  three  algebraic  relations  in  three 
unknowns. 

The  isentropic  relation  for  pressure,  the  first  of  Equation  30,  may  be  placed  in 
delta  law  form  by  considering  incremental  changes  in  the  variables  p  and  u'. 


^Pb  =  Pt 


267 
7  +  1 


8u'b 


(31) 


This  equation  and  Equation  28  are  solved  for  The  pressure  and  density  are  then 
obtained  from  the  isentropic  relations.  Equations  30.  The  velocities  are  also  calculated 
from  u'b  using  Equation  29  and  the  last  three  of  Equations  27.  The  specification  of 
the  flow  within  the  subsonic  inflow  boundary  cell  is  then  complete  for  flow  without 
chemistry. 

For  flow  with  chemistry,  the  mass  fractions  of  the  species  are  also  specified  at  the 
inflow  boundary.  The  species  densities  are  then  calculated  from  the  specifies  species 
mass  fractions  and  the  calculated  total  density.  At  subsonic  inflow  boundaries  the 
vibrational/electronic  modes  of  internal  energy  are  assumed  to  be  in  equilibrium  with 
the  translational/ rotational  modes.  Therefore  the  vibrational  temperature  is  set  equal 
to  the  translational  temperature  and  the  energies  are  calculated  accordingly. 

When  the  two-equation  turbulence  model  is  used,  the  turbulent  kinetic  energy  k 
and  dissipation  rate  e  are  specified  by  the  user. 
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5  Navier- Stokes  Solution  Procedure 


The  Navier-Stokes  equations  are  solved  using  an  LU-SGS  implicit  finite- volume  method 
which  is  based  on  work  by  Yoon[18,  19].  This  type  of  algorithm  has  proven  to  be 
a  robust  and  efficient  relaxation  procedure  for  steady  state  flow  calculations.  The 
algorithm  used  in  this  investigation  is  a  combination  of  the  algorithm  presented  by 
Peery  and  Imlay[20]  and  Yoon’s  algorithm.  A  detailed  description  of  the  algorithm 
is  given  in  the  following  subsections. 


5.1  Internal  Grid  Cells 

An  individual  finite  volume  cell,  with  indices  i,  j,  and  k,  is  shown  in  Figure  13. 
Applying  the  integral  equations  in  this  volume  gives 

=  ~iDiPS  +  DjP-S  +  DkP-S)  +  Ni,,,k 


where  (/ij,*;  is  the  mean  value  of  U  in  cell  and  DiP  S,  for  example,  represents 
the  difference  of  the  fluxes  through  opposing  faces  of  the  cell. 

The  time  derivative  is  approximated  using  backward  in  time  differencing: 


Vokj.k 


SU,,j,k  =  -{Di  P-S-k  D,  P  S  +  Dk  P-S)  -f 


where 


m.,.,  =  uzii  -  uzjj. 


For  the  approximation  of  the  flux  through  a  surface  the  inviscid  and  diffusion  terms 
of  the  flux  vector  are  considered  separately. 

P  S  =  P-S'’"'  -k 


Figure  13:  Finite  Volume  Cell 
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These  terms  are  then  evaluated  in  a  manner  consistent  with  the  predominant  nature 
of  the  equations  in  the  limit  as  Re  —*  oo  (hyperbolic)  and  Re  0  (parabolic);  i.e., 
upwind  differencing  for  the  inviscid  terms  and  central  differencing  for  the  diffusion 
(viscous  stress  and  heat  flux)  terms. 

The  evaluation  of  the  inviscid  terms  is  based  on  the  flux  splitting  in  combination 
with  upwind-biased  MUSCL-like  differencing  [21].  The  code  currently  supports  two 
flux  functions;  the  flux-vector-splitting  of  Steger  and  Warming[22|  and  the  TVD 
fluxes  of  Yee[23].  The  diffusion  terms  are  evaluated  using  standard  central  differences 
[24].  These  flux  functions  are  described  in  detail  in  the  following  subsections. 

The  following  subsections  use  the  notation  F{U)  =  P-5'"’'  =  a  function  of  U. 
This  function  is  defined  in  equations  at  the  beginning  of  section  4.  The  actual  flux 
through  a  surface  is  given  a  tilde,  F,  to  distinguish  it  from  the  function. 


5.1.1  Steger  and  Warming  Flux  Function 

Flux-vector-splitting  was  developed  by  Steger  and  Warming  [22]  and  Van  Leer  [25]  as 
a  way  to  upwind  difference  the  Euler  equations  in  regions  of  subsonic  flow.  If  the  flow 
normal  to  the  surface  of  a  cell  is  supersonic,  the  theory  of  characteristics  tells  us  that 
the  flow  field  at  the  sur''ace  depends  only  on  the  solution  upstream  of  the  surface.  In 
this  case,  the  flux  at  the  surface  may  be  simply  evaluated  using  the  solution  in  the 
cell  upstream  of  the  surface. 


/  F(f/,.y.,)  if  t;'>c 
\  F(f/.+i.>.,)  \fv'<c 


(33) 


If  the  flow  is  subsonic  however,  the  flux  through  the  i  -t- 1/2  surface  depends  on  the 
solution  both  upstream  and  downstream  of  the  surface.  Then  the  simple  upwinding 
applicable  to  supersonic  flows,  equation  33,  is  no  longer  appropriate.  The  term  up¬ 
wind  differencing  must  now  be  defined  to  mean  differencing  in  a  manner  appropriate 
with  the  mathematical  characteristics  of  the  flow.  As  shown  in  Figure  14,  there  are 
characteristics  running  both  to  the  right  and  left  in  subsonic  flow  and  upwind  dif¬ 
ferencing  must  therefore  use  backward  differencing  on  some  of  the  flux  and  forward 
differencing  on  the  rest  of  the  flux. 

In  flux-vector-splitting  methods,  the  flux  function  is  written  as 


F(F)  =  F+(t/)-f  F-(F)  (34) 

so  that  the  flux-split  Jacobians.  have  eigenvalues  satisfying 

-^m  (■^~)  <  0  fo*'  (35) 

Flux  splitting  has  therefore  divided  a  flux  which  cannot  in  general  be  upwinded  into 
two  fluxes  \vhich  separately  can  be  upwinded.  With  first  order  upwinding  the  flux 
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Figure  14;  Mathematical  characteristics  of  one- dimensional  subsonic  flow 


through  the  i  1/2  surface  is 


Higher  order  accurate  methods  are  obtained  by  combined  extrapolation  and  interpo¬ 
lation  of  the  solution  to  the  surface  from  nearby  cells. 


u- 


+  J  1(1  -  S)  (t/..,.*  -  +  (1  +  *)  -  t/, .,,»))  (36) 

UM.,.k  +  J  1(1  -  -  Vm,m)  +  (!  +  «)  (tl,.,,k  -  (37) 


Here  o  varies  the  differencing  between  first  order,  </>  =  0,  and  second  order.  0=1. 
The  second  parameter,  k,  varies  the  differencing  between  fully  upwind,  «  =  —  1,  and 
central,  k  =  1.  The  second  order  flux  is  then 


Kxi2,,,k  =  F+  [u-)  +  F-  (l/+) 

This  form  of  differencing  is  called  MUSCL  differencing. 

Equations  .34  and  35  do  not  uniquely  define  the  split  fluxes  F^  (U).  In  fact,  there 
are  many  flux-splittings  that  have  been  proposed  [22,  25,  26).  The  tivo  most  common 
flux-splittings  are  those  of  Steger  and  Warming  [22]  and  Van  Leer  [25].  Both  splittings 
were  originally  developed  for  ideal  gases  and  were  later  extended  for  use  with  real 
gases  [7,  27].  The  code  uses  Steger  and  Warming  flux-vector-splitting  with  a  simple 
extension  to  multiple  species  transport  and  thermal  nonequilibrium. 

Steger  and  Warming  [22]  developed  their  splitting  based  on  the  homogeneity  prop¬ 
erty,  F  =  AU,  of  the  flux  vector  with  a  thermally  perfect  gas.  They  wrote  the  flux 
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Jacobian  as 


where 


+  A- 


=  7^-'A±7^ 


and  is  the  diagonal  matrix  with  elements 

Here  the  matrices  A,  TZ,  and  are  the  diagonal  matrix  of  eigenvalues,  matrix  of 
eigenvectors,  and  its  inverse,  for  the  flux  Jacobian  A.  Therefore,  A'*'  and  A~  are  the 
flux  Jacobian  matrix  with  the  negative  and  positive  eigenvalues  set  to  zero.  The  split 
fluxes  are  F*  =  A^U .  This  process  may  be  performed  numerically  for  each  face  on 
each  time  step,  but  it  requires  less  computer  time  if  the  matrix  multiplies  are  not 
required. 

To  eliminate  the  matrix  multiplies,  the  contribution  to  the  flux  from  each  eigen¬ 
value  must  be  considered  separately.  There  are  three  unique  eigenvalues:  Aj  =  v\ 
A2  =  e'  +  c,  and  A3  =  u'  —  c.  The  \\  eigenvalue  is  repeated  so  that,  for  flows  without 
species  transport  or  thermal  nonequilibrium,  A  =  diajf  {Ai,  A2,  Ai,  Aj,  A3}.  The  sep¬ 
arate  contributions  from  each  eigenvalue  are  obtained  by  defining  diagonal  matrices 
with  all  but  the  desired  eigenvalue  set  to  zero. 

Ai  =  diag  {Xi,0,\i,Xi,0} 

A2  =  {0,  A2,0,0,  0} 

A3  =  dfag  {0,0,0, 0,  A3} 

The  fluxes  are  then  X^TZU .  Performing  the  above  operations  for  thermally 

perfect  gas  without  species  transport  or  thermal  nonequilibrium  gives. 
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Vinokur  [7]  used  an  alternative  approach  to  derive  the  Steger  and  Warming  fluxes 
and  found  that  the  above  formulas  also  apply  to  real  gases,  provided  that  7  is  defined 


7  = 

'  p 


The  split  fluxes,  are  obtained  from  the  by  splitting  according  to  the  sign 
of  the  eigenvalue.  For  — c  <  i/  <  0  we  therefore  have 


F+  =  F^  and  F"  =  F^;  +  F^. 


These  fluxes  are  relatively  inexpensive  to  calculate  compared  with  the  formulation 
requiring  numerical  matrix  multiplies. 

The  Steger  and  Warnriing  fluxes  defined  by  equations  38  through  40  may  be  simply 
extended  to  handle  species  transport  and  thermal  nonequilibrium.  First,  evaluate  the 
flu.xes  with  species  transport  equations  combined  into  a  single  continuity  equation 
having  flux,  Fmass-  Then 


^  s  ^  ^  mass 

p 


The  same  approach  is  used  for  the  vibrational/electronic  energy  flux, 


p+  -  1?+ 

■*  Cv  ^  mass 

p-  _  jp^v)^  p- 
^ev  ^  mass 

and  for  the  fluxes  for  the  <  turbulence  model.  Using  this  approach,  the  additional 
cost  of  computing  the  fluxes  for  additional  species  transport  or  energy  equations  is 
minimal. 


5.1.2  Harten  and  Yee  Flux  Function 

In  the  early  1980's  Yee,  Warming,  and  Harten  [23],  and  others  [28,  29,  30]  pre.-ented 
upwind  biased  schemes  for  nonlinear  scalar  equations  for  which  the  total  variation  of 
the  solution  always  diminishes  as  the  solution  proceeds. 

rV'(F"+*)  <  rv(F") 

where 

t=oo 

n’((y)=  Y,  ie.+i-c.l 

1  =  — 00 

Schemes  which  are  total  variation  diminishing  (TVD)  are  guaranteed  not  to  generate 
spurious  oscillations  and  are  therefore  very  robust  for  applications  involving  shock 
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waves.  When  the  same  scheme  is  applied  to  nonlinear  systems  it  is  no  longer  TVD 
but  it  still  exhibits  excellent  accuracy  and  robustness  for  problems  involving  strong 
shock  waves.  A  finite-volume  version  of  Harten  and  Yee’s  scheme  is  included  in  the 
code.  The  scheme  is  obtained  by  using  the  Harten- Yee  flux  function  described  in  this 
section. 


The  Harten- Yee  flux  function  is  written  as  a  central  difference  flux  plus  a  dissipa¬ 
tion  term. 

Fi+\l2,],k  =  ^  +  Tij.fc  +  {T^A^A)i+ii2,j,k  (41) 


The  dissipation  operators  are  defined  as  follows. 


=  9x  +  gx+i  -  ^  (Ah 


where 

i^i+i,j,k  ~  Ui,j,k)  (43) 

and,  in  the  standard,  form  =  \zim\,  where  z  is  any  matrix.  .A  modified  form 

of  ’J(r)  will  be  discussed  latter  in  this  section. 

A  key  to  the  definition  of  the  Harten- Yee  flux  is  the  eigensystem  of  the  flux 
Jacobian  matrix,  A  =  |£.  The  components  of  the  eigensystem  are  the  diagonal  matrix 
of  eigenvalues  A,  the  matrix  of  right  eigenvectors  IZ,  and  its  inverse  7i~^ .  Expressions 
for  these  are  given  in  appendix  A.  In  the  above  equations  is  the  dissipation  term 
which  will  reduce  the  scheme  to  Roe’s  first-order  upwind  scheme  [31]  in  regions  of 
steep  gradients.  The  definition  of  the  flux  is  completed  by  the  relations 

gi  =  5  max  [o,  min  (<t,+j/2|q, +1/21,5  <7i-i/2«i-  1/2)] 

5  =  sign{ai+i/2) 

(^,+1/2  =  (a, +1/2) 


and 


r<-t-i/2 


if  «,+i/2  ^  0 
if  oii+1/2  =  0 


To  avoid  excessive  roundoff  eiror,  r,+i/2  in  the  above  equation  is  actually  set  to  zero 
any  time  a,+j/2  is  near  zero  (  —  10"^^  <  a, +1/2  <  10“^^). 

This  scheme  is  theoretically  second-order  accurate  in  space.  Simply  stated,  the 
scheme  is  a  central  difference  scheme  where,  for  regions  of  smooth  solutions,  the 
dissipation  term  is  turned  off.  Conversely,  for  regions  with  steep  gradients  such  as 
shock  wave.-:,  it  reduces  to  a  first  order  scheme  to  avoid  aphysical  oscillations  in  the 
solution.  The  first  order  scheme  to  which  it  reduces  is  the  flux-difference  splitting  of 
Roe. 

Since  the  Harten- Yee  flux  function  is  based  on  Roe’s  flux-difference  splitting,  it 
shares  certain  abnormalities  with  Roe’s  scheme.  In  particular.  Roe’s  scheme  does 


42 


not  necessarily  satisfy  the  entropy  inequality  and  may  therefore  yield  aphysical  solu¬ 
tions  such  as  expansion  shock  waves  [28,  29,  26].  A  related  phenomena,  with  serious 
ramifications,  is  the  stagnation  line  Carbuncle  described  in  [33].  The  Carbuncle  is 
an  aphysical  inward  or  outward  bowing  of  the  bow  shock  wave  near  the  stagnation 
line  on  a  supersonic  blunt  body.  It  occasionally  occurs  when  an  unmodified  Roe  or 
Harten-Yee  scheme  is  used  for  hyp..'rsonic  blunt  body  calculations.  It  is  eliminated 
by  modifying  the  function  '!'(-)  so  that  it  never  becomes  zero.  We  use 

Far  away  from  zero  each  element  of  approaches  the  absolute  v'^alue  of  the  corre¬ 
sponding  element  of  as  it  should.  When  Zmm  approaches  zero,  however,  ^mm 
approaches  the  positive  number  £„• 

The  values  of  are  calculated  based  on  user  defined  constants  and  the  type 
of  characteristics  equation  which  corresponds  to  If  the  characteristic  transports 
entropy 
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If  the  characteristic  transports  acoustic  waves 
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Finally,  if  the  characteristic  transports  momentum  components  tangent  to  the  surface 
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Different  constants  were  provided  for  each  type  of  characteristic  because,  for  a  given 
problem,  constants  which  are  appropriate  for  one  type  of  characteristic  equation  may 
be  exccssi%e  for  another  type.  For  example,  a  blunt  body  problem  may  require  a 
fairly  large  value  of  to  avoid  a  Carbuncle  but  a  relatively  small  value  for  to 
avoid  excessive  smearing  of  the  boundary  layer. 

The  epu  time  for  the  matrix  multiplies  in  equations  41  and  43  would  normally 
increase  quadratically  with  the  number  of  transport  equations.  For  example,  the 
matrix  multiplies  would  require  more  than  four  times  as  much  epu  time  for  the  Wray 
model  (7  species.  11  transport  equations)  than  for  no  chemistry  (1  species,  5  transport 
equations).  In  this  code,  however,  the  matrix  multiplies  are  performed  analytically 
whenever  possible  and  the  epu  time  increases  nearly  linearly  with  the  number  of 
added  species. 

To  evaluate  the  variables  at  the  cell  surfaces  arithmetic  averaging  of  the  cell  cen¬ 
tered  values  is  used.  This  kind  of  averaging  has  the  advantage  of  computational  sim¬ 
plicity  and  can  be  easily  extended  for  problems  in  thermochemical  nonequilibrium. 
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Other  kinds  of  averaging  can  also  be  used,  but  they  are  usually  more  complicated. 
Arithmetic  averaging  takes  the  form 

1  , 

A+i/2  -  2 

The  same  procedure  is  used  to  calculated  u,  v,  w,  and  a  at  the  surface. 

5.1.3  Diffusive  Fluxes 

The  diffusive  fluxes  are  the  contributions  of  the  viscous  stresses,  heat  conduction,  and 
species  diffusion  to  the  surface  flux,  P-5  in  equation  32.  These  terms  are  obtained 

P-S'  =  Ffm  |5| 

where 

/  \ 


Ff  = 
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-f-r.3 
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—  u  — 
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3  '^dxk 


^  ’  dx, 

l-X, 


The  derivatives  are  approximated  using  standard  central  differences  [24]  centered  at 
the  cell  face. 
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The  calculation  of  the  derivatives  is  the  same  regardless  of  the  dependent  variable, 
so  we  will  present  formulas  for  temperature  T  only.  First  the  derivatives  are  estimated 
with  respect  to  the  ^,7/,  C  coordinates  where  r/,  and  C  are  the  coordinates  running 
in  the  increasing  i,  and  k  direction  respectively.  The  estimation  of  the  temperature 
derivatives  are  given  below  for  the  i  +  1/2,  j,  k  surface. 
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.j,k 
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dT 
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~  +  Tij^k+l  ~  Ti,j,k-l] 

The  treatment  of  the  j  +  112  and  fc  +  1/2  surfaces  is  very  similar.  Then  a  transfor¬ 
mation  is  performed  to  obtain  the  derivatives  with  respect  to  the  x,  y,  z  coordinates. 
The  transformation  of  the  derivatives  is  performed  as  follows. 
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When  the  3x3  matrix  above  is  inverted. 
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the  elements  of  the  resulting  matrix  may  be  approximately  evaluated  in  terms  of 
surface  area  projections  and  volumes. 


5.1.4  Implicit  Treatment  —  Point  Implicit 

The  flow  equations  with  finite-rate  chemistry  are  stiff  (difficult  to  solve)  because  of 
the  wide  range  of  time  scales  involved.  Certain  reactions  may  have  time  constants 
orders  of  magnitude  smaller  than  any  flow  characteristic  time  (r  =  L/V).  As  a  result, 
explicit  methods  are  unstable  at  any  reasonable  time  step  and  implicit  methods  are 
required.  Unfortunately,  our  experience  is  that  implicit  methods  are  also  adversely 
affected  by  the  chemical  stiffness  and  that  unacceptably  small  time  steps  must  be 
taken.  Add  this  to  the  fact  that  the  chemical  species  transport  equations  dramatically 
increase  the  cost  of  an  implicit  time-step  and  you  have  a  very  inefficient  scheme. 

One  standard  way  to  overcome  the  stiffness  of  the  chemical  source  terms  is  to 
treat  the  chemical  source  terms  implicitly  and  the  cell  fluxes  explicitly.  The  discrete 


45 


flux  balance  is  then 


=  -  (a  ^ -5  +  As  +  A  P  S)"  +  JV”«  (44) 

where  n  is  the  current  time  step  at  which  the  solution  is  known  and  n  +  1  is  the  new 
time  step  at  which  the  solution  is  being  calculated.  It  is  standard  to  linearize  the 
implicit  source  term 


so  that  the  discrete  flux  balance  equation  becomes 


6Uij^  =  -{DiPS  +  DiPS  +  Di,PS)"  + 


The  Jacobian  of  the  source  terms  is  an  iV  x  iV  matrix,  where  N  is  the  number  of 
transport  equations.  Equation  45  therefore  represents  a  coupled  system  of  N  linear 
algebraic  equations  which  must  be  solved  for  each  finite-volume  cell  on  each  time 
step.  The  resulting  method  is  called  a  block  point  implicit  method  since  it  requires 
the  solutions  of  linear  systems  with  N  x  N  block  matrices.  If  the  number  of  species 
is  large,  the  solutions  of  these  systems  can  become  the  most  expensive  part  of  the 
procedure. 

The  cost  of  the  block  point  implicit  method  may  be  reduced  somewhat  by  taking 
advantage  of  the  natural  sparseness  within  the  source  Jaeobian  matrix.  In  particular, 
four  rows  of  this  matrix,  those  corresponding  to  the  momentum  and  total  energy 
equations,  are  always  zero.  For  example,  for  laminar  thermochemical  nonequilibrium 
flows 
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Utilizing  the  sparseness  of  this  matrix,  the  cost  of  solving  the  linear  system  of  equation 
45  is  little  more  than  the  cost  of  solving  an  {NS  -f  l)x{NS  -f  1)  linear  system. 

The  cost  of  solving  the  linear  system  may  be  further  reduced  by  using  atom 
conservation  equations  to  eliminate  additional  lines  of  the  matrix.  For  example.  Park 
and  Yoon  [34]  used  the  conservation  of  oxygen  and  nitrogen  atoms  to  reduce  the 
size  of  the  linear  system  for  non-ionized  air  by  two  equations.  This  reduced  the 
size  of  the  system  from  6i6  to  4x4,  thereby  reducing  the  computational  effort  for 
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the  system  solution  by  two  thirds:  from  191  floating  point  operations  to  62  floating 
point  operations.  This  approach  to  matrix  reduction  could,  with  some  difficulty,  be 
extended  to  more  complex  reactions  such  as  Parks  ionizing  air  reaction  model  [11]  or 
Jachimowski’s  hydrogen  air  combustion  model  [35].  Unfortunately,  the  savings  is  not 
as  great  for  models  with  a  large  number  of  species  (ionizing  air)  as  it  is  for  models 
with  few  species  (non-ionizing  air).  Furthermore,  this  approach  is  very  difficult  to 
implement  with  general  user-specified  chemistry.  The  code  therefore  uses  a  different, 
more  general,  approach  to  reducing  the  cost  of  solving  the  linear  system. 

Two  things  have  been  done  to  improve  the  efficiency  of  the  block  point  implicit 
method.  The  first  is  to  automatically  reduce  the  reaction  rates  when  they  are  exces¬ 
sive.  The  second  is  to  reduce  the  cost  of  the  system  solution  by  replacing  the  source 
Jacobian  matrix  with  a  diagonal  approximation  to  the  source  Jacobian  matrix. 

The  stiffness  is  greatly  reduced  by  limiting  the  reaction  rates  so  that  the  cell 
Damkohler  number  does  not  get  too  large.  The  cell  Damkohler  number,  D^,  is  the 
ratio  of  the  characteristic  flow  time  (time  for  flow  to  cross  cell)  to  the  characteristic 
reaction  time  (time  for  reaction  to  be  1/e  completed). 
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If  Da  is  much  larger  than  unity  the  model  is  attempting  to  complete  the  reaction  in 
less  time  than  it  takes  for  the  flow  to  cross  the  cell.  In  steady  state  flows  this  means 
that  the  model  will  attempt  to  complete  the  reaction  in  less  than  a  cell  width.  We 
cannot  accurately  calculate  any  flow  feature  in  less  than  2  cell  widths  so  there  is  no 
advantage  in  letting  the  reactions  progress  that  fast.  In  cases  where  the  reaction  rates 
calculated  from  the  .Arrhenius  equations  give  Da  greater  than  a  user  specified  value, 
Damai'  we  simply  reset  the  reaction  rates  so  that  Da  equals  the  Da„,ax-  The  backward 
rates  are  modified  by  the  same  factor  as  the  forward  rates  so  that  the  equilibrium 
solution  is  unchanged.  Numerical  studies  have  shown  that  there  is  no  effect  on  the 
accuracy  of  the  final  solution  if  iJamox  >  1^. 

The  second  modification  of  the  block  point  implicit  method  is  to  replace  the  true 
Jacobian  of  the  species  source  terms  with  an  approximate  diagonal  Jacobian  [55]. 
With  a  diagonal  Jacobian,  the  solution  of  the  linear  system  given  by  equation  45 
reduces  to  .V  scalar  divisions;  very  inexpensive  compared  to  the  solution  of  even  the 
reduced  block  linear  system.  Two  approximate  forms  for  the  source  Jacobian  were 
studied.  The  first  approach  uses  the  spectral  radius  of  the  true  source  Jacobian  as  the 
diagonal  term  of  the  approximate  source  Jacobian.  Finding  the  eigenvalues  of  can 
be  difficult  so  we  actually  use  the  L2-norm  of  the  matrix.  This  gives  a  conservative 
estimate  of  the  spectral  radius. 

The  above  approach  is  equivalent  to  rescaling  the  time  step  for  the  species  trans¬ 
port  equations  so  that  the  fastest  reaction  is  explicitly  stable.  This  is  stable  but  may 
converge  slowly  since,  when  fast  reactions  are  present,  it  can  dramatically  slow  down 
the  convection  of  mass.  This  is  particularly  true  when  one  reaction  is  much  faster 
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than  the  others  but  does  not  involve  any  of  the  dominant  species.  In  these  cases  it  is 
much  more  efficient  to  rescale  the  time  steps  for  each  species  independently. 

The  second  approach  bases  the  diagonal  term  for  each  species  continuity  equation 
on  only  the  reactions  affecting  that  species.  This  is  done  by  replacing  the  diagonal 
elements  of  the  source  Jacobian  matrix  by  the  L2-norm  of  the  elements  along  the  row. 
This  approach  converges  quicker  than  the  uniform  rescaling  but  it  does  not  insure 
atom  conservation  in  the  unsteady  solution.  Atoms  are  conserved  in  the  steady  state 
solution  however.  This  latter  approach  is  currently  implemented  here. 


5.1.5  Implicit  Treatment  —  LU-SGS 

The  flux  vector  is  a  function  of  the  solution  in  nearby  cells.  Consider  only  the  2  +  1/2 
surface. 

■P‘‘5i+l/2  =  (47) 

The  flux  is  evaluated  at  the  new  (unknown)  time  level  and  linearized  using  Yoon’s 
approximate  Jacobians. 


^■■^1+1/2  ~  ■^■•^1+1/2  +  ^t+l/2,j,k^^i,j,k  +  ^i+l/2,],k^^i+l,j,k 
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^i  +  \/2,j,k  ~  2  ~  P(-^«+1.2.*;))  ) 

A  is  the  flux  Jacobian,  and  p(A)  is  the  spectral  radius  (maximum  eigenvalue)  of  A. 
Substituting  equation  48  and  the  corresponding  flux  for  the  j—  and  direction  faces 
into  the  discrete  conservation  equation,  equation  32,  yields  the  implicit  finite-volume 
equation.  This  equation  may  be  written  as  three  steps; 

1.  Calculate  the  residuals  for  each  cell  using  an  explicit  flux  balance. 

=  -(D,PS+  D.P-S  +  DkP-Sr  +  N;^^^^,  (49) 

The  residuals  will  approach  zero  as  the  solution  approaches  steady  state. 

2.  Calculate  the  change  in  the  solution  by  solving  the  block-linear  system  of  alge¬ 
braic  equations.  The  rows  of  this  system  are  the  block-linear  algebraic  relations 
resulting  from  the  implicit  flux  balance  applied  to  each  cell. 

+  Di  ^^t.j,k  +  ^i+i/2,j,k  ^Ui+l.j,k] 

+  Dj  {P,^j^i/2.k  "h  ^,,j+l/2.fc  ^^i,]  +  l.k] 

+  Dk  ^^i,J,k  +  ^t,j,k+l/2  ~  ^i,J,k 
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Because  of  the  choice  of  Jacobians,  this  resulting  block  matrix  multiplying  SUij^k 
is  approximately  diagonal  when  the  source  Jacobian,  is  zero. 

3.  Update  the  solution  using 


where  r  is  a  relaxation  factor.  The  code  starts  with  a  user  specified  r  near  one 
and  updates  the  solution.  If  the  solution  at  time  level  n  -f-  1  results  in  negative 
pressures  or  temperatures  the  relaxation  factor  is  halved  and  the  solution  at  n+1 
is  recalculated.  This  process  is  continued  until  all  pressures  and  temperatures 
are  non-zero  or  r  reaches  a  specified  minimum. 

The  above  algorithm  consists  of  three  steps  which  must  be  done  on  each  time 
step.  The  first  step  is  an  explicit  step  which  calculates  the  residual  from  the  local 
variation  of  the  solution.  The  second  step  implicitly  calculates  the  change  in  the 
solution  according  to  global  variations  in  the  residual.  The  last  step  updates  the 
solution. 

The  second  step  is  the  most  difficult  because  it  requires  the  solution  of  a  large 
sparse  system  of  linear  equations.  The  LU-SGS  algorithm  approximately  solves  this 
system  using  two  sweeps  of  a  point  Gauss-Seidel  relaxation.  Assume  the  iteration 
sweeps  first  in  the  direction  of  increasing  i,j,k  indices  and  then  in  the  direction  of 
decreasing  i.j.k  indices.  The  increasing  ij,k  iteration  is  performed  by  applying  the 
ecjuation 
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.it  (-acli  internal  cell.  The  decreasing  i.j.k  iteration  is  performed  by  applying  the 
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This  scheme  may  be  written  as  an  approximate  lower-upper  (LU)  factorization  by 
subtracting  equation  ol  from  equation  52  and  rearranging  to  remove 
h'ft  hand  side. 
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This  equation  replaces  equation  52.  It  yields  the  same  results  as  equation  52  but 
requires  fewer  floating  point  operations  and  less  memory.  The  diagonal  wavefront  al¬ 
gorithm  is  used  in  the  solution  of  the  Lower  and  Upper  systems  to  allow  vectorization 
or  parallelization  (see  section  3). 

When  the  source  Jacobian,  |^,  is  zero  the  LU-SGS  scheme  is  very  efficient  because 
the  main  block  (i.e.  the  matrix  multiplying  is  diagonal  and  easy  to  invert. 

When  there  are  chemical  reactions  the  source  Jacobian,  given  by  equation  46,  is  not 
zero.  Therefore,  the  resulting  matrix  is  not  diagonal  and  would  be  more  expensive  to 
invert.  We  therefore  use  the  approximate  source  Jacobians  discussed  in  the  previous 
section  to  diagonalize  the  main  block  for  chemically  reacting  flows. 


5.2  Boundary  Conditions 

The  various  boundary  condition  options  available  in  the  code  are  described  in  section 
4.4.  These  boundary  conditions  are  in  the  form  of  algebraic  or  differential  relation¬ 
ships  for  the  flow  variables  on  the  boundary.  The  boundary  conditions  are  satisfied 
numerically  by  setting  the  values  of  the  flow  variables  in  the  boundary  cells  so  that 
the  boundary  conditions  are  satisfied  on  the  surface  between  the  boundary  cell  and 
the  first  internal  cell. 

5.2.1  Free-Slip  Walls 

The  zero  gradient  in  pressure  and  temperature  boundary  conditions  are  easy  to  im¬ 
plement.  The  pressure  and  temperature  within  the  boundary  cell  is  simply  set  to  the 
values  in  the  first  internal  cell.  The  impermeable  wall  condition  is  set  by  reflecting 
the  velocity  vector  about  the  surface.  Mathematically, 

Tboundary  cell  ~  i  n 

V  /  tniernal  cell 

where  V  is  the  velocity  vector  and  n  is  the  unit  surface  normal  vector.  The  average 
velocity  at  the  surface  is  therefore  tangent  to  the  surface. 

5.2.2  No-Slip  Walls 

For  no-slip  walls  the  easy  boundary  conditions  are  for  velocity  and  pressure.  The 
velocity  in  the  boundary  cell  is  set  equal  to  the  negative  of  the  velocity  in  the  interior 
cell  so  that  the  velocity  at  the  wall  is  zero.  The  pressure  in  the  boundary  cell  is  equal 
to  the  value  in  the  first  internal  cell.  The  wall  temperature  depends  on  whether  the 
wall  is  adiabatic,  isothermal,  or  radiating/conducting. 
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If  the  wall  is  adiabatic  (and  nonradiating),  the  temperature  gradient  in  the  gas  at 
the  wall  is  zero.  This  is  satisfied  by  setting  the  boundary  cell  temperature  equal  to 
the  temperature  in  the  first  internal  cell. 

If  the  wall  is  isothermal,  the  temperature  at  the  wall  is  specified.  This  is  satisfied 
by  extrapolating  the  temperature  from  the  internal  cell  and  wall  into  the  boundary 
cell.  Then  when  the  boundary  cell  and  internal  cell  temperatures  are  averaged  the  de¬ 
sired  wall  temperature  is  obtained.  The  boundary  cell  species  densities  are  calculated 
from  the  temperature,  pressure,  and  zero  gradient  in  species  mass  fractions. 

The  most  complicated  case  is  when  the  radiating/conducting  wall  option  is  chosen. 
In  this  case  the  wall  temperature  is  given  by  the  complicated  expression,  equation  20. 
For  numerical  reasons  it  is  often  desirable  to  slow  down  the  change  in  wall  temperature 
during  the  initial  transients,  so  we  add  a  heat  capacity  (per  unit  volume),  c^c,  to  the 
wall.  For  simplicity,  the  heat  capacity  is  lumped  at  the  surface  so  that  the  expression 
for  the  heat  conduction  into  the  wall  doesn’t  change.  The  resulting  expression  for  the 
wall  temperature  is 

dT^ 

('hc^wall  ~  9^  T  9*  T 
or 

i  ,  dr  ,  _  r.  ^4  .  ^4i 

Chc^wali  -I  j  ^  (b4) 

d/  an  ^  t^aii  '■ 

The  above  expression  is  an  ordinary  differential  equation  that  must  be  solved  at  each 
wall  surface  at  each  time  step.  The  equation  is  highly  nonlinear  and  very  stiff  because 
of  the  T*  term  in  the  radiative  heat  flux.  The  equation  is  solved  using  a  linearized 
fully  implicit  procedure. 
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When  the  wall  capacity,  c^c-  is  set  to  zero  this  is  equivalent  to  doing  one  Newton 
iteration  on  the  nonlinear  equation  per  flow  solver  time  step. 


5.2.3  Rarefied  Gas  Slip  Walls 

In  section  4.4.3  expressions  were  given  for  the  velocity  at  the  edge  of  the  Knudsen  laj^er 
and  temperature  jump  across  the  Knudsen  layer.  The  actual  boundary  conditions  for 
the  .Xavier-Stokes  solver  are  the  values  of  the  flow  variables  at  the  edge  of  the  Knudsen 
layer,  as  indicated  by  the  subscript  s,  not  the  values  of  the  flow  variables  at  the  wall. 
In  this  section  we  describe  the  numerical  treatment  of  these  boundary  conditions. 

The  first  step  is  to  find  the  temperature  at  the  edge  of  the  Knudsen  layer,  Jj,.  This 
temperature  depends  on  the  nature  of  the  boundary.  If  the  wall  is  adiabatic  with  no 
radiation,  equation  24  holds  and  the  temperature  gradient  is  simply  related  to  the  slip 
surface  velocities  and  velocity  gradients.  The  velocities  and  gradients  are  evaluated 
from  the  velocities  at  the  previous  time  step  and  the  desired  temperature  gradient  is 
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obtained  by  setting  the  boundary  cell  temperature  appropriately.  For  example,  if  the 
surface  is  the  j  =  5/2  boundary,  then 


Ti,2,k  =  ^  [(«'., 3, fc  +  Wi,2,fe)(«i,3,fc  -  “<.2.fc)  +  +  «^i,2,A:)«3,fc  “  «^!,2,t-)]  . 


where  u'  and  w'  are  two  orthogonal  components  of  the  surface  tangent  velocity. 

If  the  wall  is  either  isothermal  or  radiating/conducting,  T,  must  be  calculated 
using  the  temperature  jump  relation  given  in  equation  25.  The  equation  is  rewritten 
as  a  function  of  the  temperature  ratio  whose  value  should  be  zero. 


This  nonlinear  equation  is  solved  using  a  Newton  iteration. 


Currently,  one  iteration  of  the  Newton  iteration  is  performed  per  time  step  so  that 
the  equation  is  not  solved  completely  on  each  time  step,  but  is  accurately  solved  when 
the  solution  is  at  steady  state. 

When  the  wall  is  isothermal,  the  wall  temperature  doesn’t  change  and  the 
above  procedure  is  used  to  calculate  T,,  the  temperature  at  the  edge  of  the  Knudsen 
layer.  The  temperature  in  the  boundary  cell  is  then  determined  by  extrapolation. 
If  the  wall  is  radiating  and/or  there  is  heat  conduction  into  the  wall,  the  wall  tem¬ 
perature  is  given  by  equation  26.  In  this  case,  equations  25  and  26  are  a  pair  of 
coupled  nonlinear  algebraic  equations  for  the  T,  and  T^.  The  first  equation  is  solved, 
as  described  above,  using  one  step  of  a  Newton  iteration.  For  the  solution  of  this 
equation  the  value  of  T^,  from  the  previous  time  step  is  used.  The  second  equation 
is  then  solved  for  as  described  in  the  previous  section  while  Msuming  that  ^ 
is  constant.  This  procedure  converges  to  the  appropriate  wall  temperature  and  slip 
surface  temperature  at  steady  state. 

The  next  step  is  to  calculate  the  velocities  at  the  edge  of  the  Knudsen  layer.  These 
velocities  are  given  by  equations  21  through  23.  The  velocities  are  first  transformed 
into  the  orthogonal  surface  normal  coordinate  system  {x\y\z')  where  y'  is  the  co¬ 
ordinate  normal  to  the  surface.  The  derivatives  in  equations  21  and  23  are  then 
differenced.  For  example,  for  the  j  =  5/2  boundary  this  yields 
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The  resulting  algebraic  expressions  are  solved  for  u'  and  w'^  to  give 

c  . 


u'  = 


w.  = 


— —  U  'i 

l  +  C 

c  , 


- W  r. 

1  +  C 


where 


f¥2  -  0 

yl  0 


NS 


!=1 
5  J 


riie  slip  surface  velocities  are  then  transformed  back  to  the  global  {x,y,  c)  coordinate 
s\  steni  and  the  boundary  cell  values  of  velocity  are  determined  by  extrapolation. 

The  boundary  cell  values  of  pressure  and  species  density  are  calculated  as  de¬ 
scribed  in  the  previous  section. 


5.2.4  Supersonic  Inflow 

The  supersonic  inflow  condition  is  satisfied  by  setting  the  flow  variables  in  the  bound¬ 
ary  cells  to  the  values  specified  by  the  user. 

5.2.5  Supersonic  Outflow 

riie  supersonic  outflow  condition  is  satisfied  by  setting  the  flow  variables  in  the  bound¬ 
ary  cells  equal  to  the  flow  variables  in  the  neighboring  internal  cell. 


5.3 


6  Parallel  Navier- Stokes  Implementation 

The  3-D  Navier-Stokes  code  “HAN A”  has  been  parallelized  on  the  Intel  Touchstone 
iPSC/S60,  Cray  Y-MP,  and  Silicon  Graphics  Iris  4-D  machines  using  domain  decom¬ 
position.  This  development  is  under  tasks  5  through  10  as  described  in  section  2.  The 
approach  is  essentially  the  same  as  that  taken  for  the  model  code  described  previously 
in  section  3. 

The  Navier-Stokes  equations  are  solved  using  Lower-Upper  Successive  Gauss- 
Seidel  relaxation  (LU-SGS).  The  procedure  uses  a  diagonal  wavefront  algorithm  in 
which  the  inner  most  loop  is  over  all  points  on  the  ^  -t-  j  +  k  constant  plane  (see 
Figure  15).  .411  points  on  this  plane  are  independent  of  the  other  points  on  the  plane 
so  that  they  may  all  be  updated  simultaneously.  This  allows  the  code  to  vectorize 
over  the  inner  loop. 

The  structure  of  the  HANA  code  is  shown  in  Figure  16.  It  is  a  multiple-zone 
(multi-block)  code  that  uses  multiple  i,j,  k  ordered  grids  patched  together  at  common 
boundaries.  The  hierarchical  structure  of  the  code  reflects  its  multiple- zone  nature. 
The  main  routine  is  a  driver  routine  which  calls  subroutines  to  operate  on  zones.  The 
main  operations  are  “perform  one  iteration  for  a  zone”  and  “transfer  data  between 
zones".  This  model  was  chosen  because  it  fits  nicely  with  the  data  partitioning  and 
interprocessor  communication  approach  required  on  parallel  computers. 

The  following  subsections  present  the  specific  implementation  of  the  domain  de¬ 
composition  technique.  Descriptions  of  the  demonstration  cases  and  timings  on  a 
single  epu  Cray  Y-MP  computer  are  presented  in  the  following  section.  Finally,  com¬ 
parisons  are  presented  for  the  parallel  efficiency,  speed  up  and  performance  of  pHAN.A 
on  the  different  parallel  machines. 


6.1  Distributed  Memory  MIMD  Computer 

The  iPSC/SGO  version  of  the  pHANA  code  requires  a  host  program  and  a  node  pro¬ 
gram  to  run.  Due  to  the  computationally  demanding  nature  of  the  Navier-Stokes 
equations  (especially  for  chemically  reacting  flows),  both  programs  must  be  written 
in  double  precision  format.  The  host  program,  which  is  usually  run  on  the  system 
resource  manager  (SRM),  allocates  a  cube  of  nodes  (processor- memory  pairs),  i'he 
host  program  also  requires  information  on  how  each  of  the  zones  is  divided  in  the  i- 
and  j-directions.  the  number  of  processors  used,  the  name  of  the  cube,  the  name  and 
location  of  the  node  program,  and  a  flag  to  tell  the  node  program  whether  it  is  a  new 
or  restart  calculation.  The  host  program  allows  a  cube  to  be  allocated  and  released 
automatically  when  the  program  is  running.  It  also  allows  the  user  to  load  a  node 
program  from  the  concurrent  file  system  (cfs)  by  specifying  the  full  path  name  of  the 
node  program.  The  Navier-Stokes  solver  is  the  node  program  which  has  to  be  loaded 
onto  the  nodes  by  the  host  program. 

Tlie  node  program  (pHAN.A)  was  parallelized  on  the  distributed  memory  com- 
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outer  using  a  permanent  domain  decomposition  technique.  The  domain  decomposi¬ 
tion  used  in  the  Navier-Stokes  implementation  is  the  same  as  the  domain  decompo¬ 
sition  for  the  model  equation.  The  decomposition  is  limited  to  a  ‘‘two-dimensional” 
decomposition,  to  reduce  the  search  effort  for  interzone  boundary  conditions.  The 
basic  idea  of  domain  decomposition  is  to  divide  the  overall  domain  two-dimensionally 
(i.e.,  in  the  I-direction  and  J-direction  only)  into  subdomains  (subzones)  which  are 
assigned  to  separate  processors.  The  K-direction  is  not  affected  by  the  decomposition. 
The  number  of  subzones  is  equal  to  the  number  of  processors  used  for  the  calculation. 
Tlie  domain  decomposition  is  modified  for  the  Navier-Stokes  code  implementation 
due  to  load  balancing,  faster  i/o.  and  memory  limitations.  The  modifications  in¬ 
clude:  input/output  and  job  management,  initial  conditions,  and  restart  capabilities. 

The  input/output  tasks,  wliich  were  performed  by  the  SRM  in  the  model  code, 
were  moved  to  one  of  the  nodes.  In  addition,  this  input/output  node  (processor)  was 
dedicated  to  perform  all  job  management  activities.  The  reasons  of  this  change  were; 

1 .  Tlie  amount  of  data  that  can  be  transferred  from  the  SRM  to  the  nodes  is  limited 
to  256  Kliytes.  Node  to  node  transfers  do  not  have  this  limitation.  Since  the 
Intel  Touchstone  favors  the  passing  of  a  few  large  blocks  of  data  rather  than 
man\'  small  blocks,  several  variables  are  combined  and  passed  simultaneously. 
This  reduced  the  overh<*ad  in  establishing  interprocessor  communication  links. 

2.  Dedicating  a  node  to  i/o  avoids  the  load  leveling  problems  seen  with  the  model 
<‘quation.  I/o  is  generall}’  performed  R- sequential  code  which  cannot  be  paral¬ 
lelized.  When  one  node  shares  i/o  and  solution  duties,  it  is  much  slower  than 
other  nodes.  Since  the  solutions  on  all  nodes  are  synchronized,  all  other  epus 
have  to  wait  until  the  node  sharing  i/o  and  solution  duties  is  finished.  Using 
a  dedicated  node  for  i/o  and  job  management  duties,  the  solver  epus  ha\e  less 
idle  time. 

The  second  modification  to  the  domain  decomposition  involved  the  initialization 
process.  Previously,  the  mesh  was  input  and  the  variables  were  initialized  on  the  i/o 
node  before  the  domain  was  split  and  distributed  to  the  solver  nodes  Th-  oroces-s  was 
modified  Ix'caiise  of  the  8  megabytes  memory  limitation  of  the  nodes  on  the  standard 
iP.S('/860  machine,  which  leaves  only  5  megabytes  for  data  after  the  executable  is 
loadc'd.  fhis  seriously  limited  the  size  of  the  problem  that  couiu  be  run  on  this 
com: "  ter.  The  modified  initialization  procedure  is; 

1.  The  i/o  node  reads  in  the  original  mesh  from  an  external  mesh  file. 

2.  The  i/o  node  splits  the  mesh  into  subdomains  based  on  the  specified  number  of 
divisions  in  the  I  and  J-diret ' ions. 

2.  .\fter  the  original  mesh  has  been  split  into  subdomains,  the  i/o  node  sends  the 
subdomain  meshes  to  the  appropriate  solver  nodes. 
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4.  The  solver  nodes  now  perform  the  flow  field  initialization  based  on  the  data 
received  from  the  i/o  node. 

By  distributing  the  mesh  before  initializing  the  flow  variables,  the  memory  required 
by  the  i/o  node  is  minimize  and  larger  ca^es  can  be  calculated  on  the  machine. 
However,  the  initial  interprocessor  communication  increased  due  to  the  passing  of  the 
parameters  defining  the  initial  conditions. 

The  third  modification  to  the  domain  decomposition  procedure  was  the  reading 
and  writing  of  the  restart  file.  Due  to  the  memory  hmitation  mentioned  above, 
the  restart  file  is  now  written  in  terms  of  the  decomposed  subdomain  zones.  When 
the  code  is  restarted,  the  i/o  node  allocates  just  enough  memory  (or  heap  space) 
for  one  subdomain,  reads  the  restart  file  for  that  subdomain,  sends  the  data  to  the 
appropriate  node  and  releases  the  memory.  These  operations  are  repeated  until  all  of 
the  subdomains  in  the  restart  file  are  input  and  sent  to  their  nodes.  This  procedure 
allows  H.AN.A^  to  use  all  of  the  available  memory,  where  each  cpu  contributes  about 
five  megabytes  of  usable  memory.  Further,  the  restart  files  are  now  written  in  the 
concurrent  file  s_'  stem  (cfs)  to  speed  up  the  read  and  write  time.  It  appears  that 
access  time  to  a  cfs  restart  file  is  an  order  magnitude  faster  than  access  time  to  a 
restart  file  on  an  SR.M  disk.  However,  the  draw  back  of  writing  the  restart  file  onto 
a  cfs  disk  is  that  it  cannot  be  accessed  interactively. 

.A  schematic  diagram  of  the  domain  decomposition  of  a  simple  two-dimensional 
geometry  is  shown  in  Figure  17.  The  size  of  the  subzones  must  be  small  enough 
to  fit  within  the  available  local  memory  of  the  processor  attached  to  that  subzone. 
Calcidations  are  driven  using  a  global  time  step.  Subzones  are  connected  by  means  of 
interzone  patches  which  provide  communication  between  adjacent  subzones,  shown 
in  Figure  18.  This  communication  between  the  subzones  on  different  nodes  is  done 
by  copying  the  data  from  the  sending  subzone  to  an  intermediate  array  on  the  same 
processor.  This  array  is  then  sent  to  an  intermediate  array  on  the  receiving  processor 
and  subsequently  used  as  boundary  conditions  in  the  corresponding  subzone  on  the 
receiving  piocessor. 


6.2  Shared  Memory  Cray  Y-MP 

riie  i)rocedure  for  parallelizing  on  the  Cray  Y-MP  is  similar  to  the  procedure  used 
for  the  .\avier-Stokes  code  implementation  on  the  Intel  Touchstone  iPSC/860  (i.e.. 
macrotasked  domain  decomposition).  A  given  zone  is  divided  up  automatically  into 
subzones.  The  number  of  subzones  is  equal  to  the  number  of  processors  used  in 
t  he  calculation.  Unlike  the  distributed  memory  MIMD  version  however,  there  is  no 
pioces.sor  dedicated  for  job  management  and  i/o  purposes.  One  of  the  processors  is 
burdened  with  the  job  management  and  i/o  responsibilities,  in  addition  to  number 
cruncliing.  However,  because  the  CRAY  uses  a  small  number  of  fast  processors,  the 
additional  work  load  on  this  proce.ssor  is  in  most  cases  negligible. 
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Parallelization  is  performed  using  macrotasking  compiler  directives.  The  conserva¬ 
tive  variable  data  for  each  subzone  is  isolated  and  can  only  be  accessed  the  processor 
which  is  assigned  to  that  subzone  (private  data).  As  with  the  Intel  iPSC/860  the 
communication  between  the  subzones  is  done  using  interzone  patches.  The  bound¬ 
ary  conditions  on  each  subzone  are  copied  into  an  intermediate  array  where  data  is 
shared  with  the  adjacent  processors.  The  macrotasking  parallelization  (large-grain 
parallelization)  is  done  by  isolating  segments  of  the  code  that  can  be  run  in  paral¬ 
lel  into  e.xternal  subrou.ines.  The  job  is  divided  into  tasks  to  be  run  on  different 
cpus  by  executing  multiple  .A-level  solver  routines  with  each  processing  on  a  differ¬ 
ent  subzone.  The  .\-level  solver  routine  is  defined  as  external  in  the  driver  routine. 
To  start  a  task,  the  compiler  directive  TSKSTART  is  used.  Thus.  TSKSTART  is 
invoked  several  times  with  the  same  external  routines  as  arguments,  causing  several 
processors  to  execute  the  solver  routine  in  parallel.  The  memory  which  contains  the 
flat  a  sets  for  each  subzone  is  treated  as  private  data.  Upon  the  completion  of  the 
solver  routine,  the  shared  data  (the  interzone  patch  array)  is  modified  by  copying  the 
Ululated  boundary  data.  The  L‘2-norm  convergence  level  is  also  shared  data  where 
each  processor  contributes  its  subzone  L2-norm.  To  avoid  memory  contention  prob¬ 
lems.  a  T.SKW'.AIT  compiler  directive  is  called  on  each  time  step  prior  to  all  boundary 
condition  updates.  The  boundary  conditions  are  updated  by  unlocking  values  in  the 
shared  data  area. 

This  domain  decomposition  technique  is  the  same  as  the  one  used  on  the  dis- 
tributed  memory  MIMD  computer.  The  transfer  of  data  between  subzones  is  different, 
as  it  is  done  using  shared  memory  rather  than  message  passing  between  processors. 

.A  major  problem  with  the  Cray  ’s  that  it  is  not  possible  to  obtain  accurate 

timings  of  parallel  runs  when  it  is  in  the  multiuser  mode.  The  timings  for  task  3  were 
made  with  the  X.AS  Cray  AMP  in  single-user  mode,  which  required  the  computer  to 
l>e  fledicated  to  our  use.  Clearly,  such  runs  must  be  scheduled  well  in  advance  and 
must  run  very  quickly.  We  were  unable  to  get  dedicated  time  on  the  NAS  Cray  A^MP 
for  timing  of  Navier-Stokes  runs. 

■Another  problem  with  the  parallelized  code  on  the  Cray  is  that  the  domain  de¬ 
composition  reduces  the  v'ector  lengths  and.  therefore,  the  vector  efficiencies.  For 
example,  a  zone  which  has  an  I-dimension  of  80  may  be  divided  into  two  zones  with 
I-dimensions  of  40.  The  zone  with  a  dimension  of  SO  has  a  longer  vector  length  than 
the  zones  with  dimension  40  and  will  therefore  take  less  than  twice  as  long  to  update. 

1  he  net  result  is  in  effect  a  reduction  in  parallel  efficiency  even  without  memory  con¬ 
tention  or  load  leveling  problems.  This  problem  may  be  avoided  by  increasing  the 
probhmi  size  as  the  number  of  epu  increa.ses. 

6.3  Shared  Memory  Silicon  Graphics  Iris 

I  lu'  proceflure  for  parallelizing  on  the  Silicon  Graphics  Iris  4-D  (SGI)  is  very  similar 
to  the  i^rocedure  used  for  the  Intel  Touchstone  iPSC/S60  and  the  Cray  A^-MP  (i.e.. 


macrotasked  domain  decomposition).  Based  on  the  number  of  user-specified  divisions 
in  the  i—  and  j— directions,  a  given  zone  is  divided  up  automatically  into  subzones. 
No  host  program  is  necessary  in  this  case,  and  one  processor  is  responsible  for  the 
job  management  and  i/o  operations,  in  addition  to  number  crunching.  This  does 
not  cause  a  significant  load  leveling  problem  unless  the  number  of  grid  points  is  very 
small. 

The  SGI  is  a  shared  memory  .MI.MD  type  computer.  The  compiler  parallelizes  by 
generating  threads  (processes)  from  DO  loop  iterations.  The  DO  loop  is  the  basic 
work  unit  that  is  split  into  concurrent  threads.  Since  parallelization  on  this  machine 
is  limited  only  to  DO  loop  parallelization,  the  original  thread  is  the  master,  and  it 
controls  the  execution  of  all  other  threads.  By  splitting  the  A-level  call  loops  in  the 
driver  routine,  the  code  is  executed  in  parallel.  The  CSDOACROSS  compiler  directive 
is  used  for  the  DO  loop  over  the  solver  routine.  The  subzone  number  is  the  variable 
incremented  in  the  DO  loop.  It  is  defined  as  a  LASTLOCAL  variable  so  that  the 
appropriate  values  are  sent  to  the  slave  threads.  The  time  step  number,  time,  and 
convergence  level  are  defined  to  be  shared.  Data  for  each  subzone  is  isolated  and  can 
only  be  accessed  by  one  thread,  thus  preventing  memory  access  collisions. 

The  transfer  of  data  between  subzones  on  different  processors  is  through  shared 
memory  locations.  The  interzone  communication  is  done  by  copying  boundary  condi¬ 
tion  data  into  an  intermediate  array.  Data  in  this  array  is  locked  until  all  processors 
finish  the  time  step.  This  is  synchronized  such  that  the  boundary  conditions  are  not 
updated  and  processors  cannot  start  working  on  the  next  time  step  until  all  proces¬ 
sors  are  finished  working  on  the  current  time  step.  If  the  subdivision  of  the  zones 
has  a  load  balancing  problem,  or  if  other  processes  delay  the  execution  of  one  of  the 
threads,  this  synchronization  will  significantly  reduce  the  parallel  efficiency.  However, 
synchronization  is  necessary  to  prevent  a  memory  collision  which  could  easily  happen 
in  a  shared  memory  system.  .After  the  time  step  is  finished  in  all  subzones,  data  in  the 
intermediate  arrays  is  available  to  complete  the  transfer  of  data  between  subzones. 


Figure  16;  HANA  code  subroutine  hierarchical  levels 


Figure  17:  Domain  Decomposition 


Figure  18:  In  ter  processor  Communication 


7  Test  Cases 


During  tasks  6,  8,  10,  and  11  we  performed  a  suite  of  nine  engineering  calculations 
on  three  different  computers  using  the  parallelized  Navier-Stokes  code.  The  nine  test 
cases  are  listed  in  Table  2  and  are  presented  in  this  section.  All  cpu  times  given  in 
this  section  are  the  single  cpu  Cray  Y-MP  time.  Timings  on  other  computers  scale 
accordingly  and  the  comparisons  are  presented  in  the  next  section. 


Case  .\o.  Description 

1.  RAE-2S22  airfoil 

2.  Unsteady  flow  past  a  cylinder 

3.  Mars  penetrator  with  CO2  chemistry  model 

4.  Hydrogen-air  shock  induced  combustion 

5.  Premixed  hydrogen-air  past  a  10°  ramp 

6.  Hydrogen-air  oblique  detonation  ram  accelerator 

7.  Methane-air  shock  induced  combustion 

S.  Methane-air  thermally  choked  and  transdetonative  ram  accelerator 

9.  Ideal  gas  3-D  ram  accelerator 


Table  2;  Suite  of  Test  Cases 


7.1  RAE-2822  Airfoil 

The  first  test  case  is  a  2D  flow  field  that  is  representative  of  a  class  of  viscous-shock 
flow  interactions  commonly  found  on  transonic  vehicles.  A  two-zone  body  fitted  C- 
mcsh  consisting  of  104  x  44  and  104  x  44  points  was  used  in  the  calculation.  The 
24  inch  chord  RAE-2S22  airfoil  has  a  sharp  trailing  edge,  a  moderate  amount  of  aft 
chamber,  and  a  12.1%  thick  supercritical  section.  This  airfoil  has  been  widely  used 
for  c  ode  validations.  The  case  computed  is  a  Mach  0.73  flow  at  a  3.19°  angle  of  attack. 
.\  converged  solution  was  adiicved  within  2000  iterations  requiring  approximately  30 
minutes  on  a  Cray  Y-MP.  The  mesh,  pressure  contours  and  the  pressure  coefficient 
from  the  computed  solution  are  shown  in  Figures  19.  20,  and  21. 

The  computed  solution  shows  a  very  good  comparison  with  the  experimental  re¬ 
sults.  .A  discrepancy  in  the  pressure  coefficient  is  due  to  neglect  of  turbulence.  Ex- 
perimentally.  a  boundary-layer  trip  strip  was  placed  at  3%i  chord. 

7.2  Unsteady  flow  past  a  cylinder 

The  second  test  case  is  an  unsteady  .Mach  0.45  flow  past  a  cylinder  at  a  Reynolds 
number  of  110.000.  The  case  is  selected  to  demonstrate  pHANA’s  unsteady  flow 
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analysis  capability.  A  three-zone  mesh  consisting  of  40  x40,  40  x40,  and  40  x96  points 
was  used  in  the  calculation.  The  calculation  was  performed  using  an  explicit  method 
and  was  carried  over  to  10,000  time  steps  taking  approximately  20  cpu  minutes. 
Streamlines  of  the  results  at  different  times  are  shown  in  Figure  22. 

7.3  Mars  penetrator  with  CO2  chemistry  model 

The  third  case  is  a  calculation  of  hypersonic  flow  of  Martian  atmospheric  gas  over  an 
axisymmetric  aerobrake.  The  geometry  for  this  vehicle  is  shown  in  Figure  23.  The 
calculation  was  done  using  a  50  x  60  single  zone  mesh  which  ignores  the  afterbody  of 
the  vehicle.  The  freestream  conditions  are  P  =  7.36N/m^,  T  =  149“ A',  U  =  47S3m/s, 
and  \I  =  24.  The  freestream  chemical  mass  fractions  for  CO2  and  are  0.9685  and 
0.0315  respectively.  A  1000“ A'  isothermal  no- slip  wall  boundary  condition  is  used  on 
the  surface. 

The  chemistry  model  used  includes  8  species  and  12  reactions  [64],  as  shown  in 
Table  5.  The  temperature  used  in  the  Arrhenius  rate  expression  is  the  geometric 
mean  of  the  translational  and  vibrational  temperatures,  T  =  \/TTy.  The  Landau- 
Teller  coefficients  for  the  vibration/electronic  energy  source  term  are  also  taken  from 
[64].  .At  these  conditions,  the  effect  of  ionization  is  negligible  so  it  was  neglected  in 
the  chemistry  model.  A  converged  solution  was  obtained  in  6000  iterations  requiring 
about  1.0  Cray  Y-MP  cpu  hour. 

The  computed  temperature  contours  for  this  calculation  is  shown  in  Figure  24. 
The  flow  is  energetic  enough  to  cause  significant  decomposition  of  CO2  but  much  less 
dissociation  of  N2.  As  expected,  the  temperature  peak  behind  the  shock  is  visible  all 
along  the  aerobrake  surface. 


7.4  Hydrogen-air  shock  induced  combustion 

The  fourth  test  case  is  the  shock-induced  combustion  of  a  Hydrogen-air  mixture. 
Shock-induced  combustion  phenomena,  ranging  from  decoupled  shock-deflagration 
systems  to  overdriven  oblique  detonation  waves,  were  experimentally  investigated  in 
the  mid  1960’s  and  early  1970’s.  Experiments  mostly  consisted  of  firing  ballistic  and 
other  blunt  body  projectiles  through  explosive  mixtures  such  as  H2IO2  and  //2/Air. 
Calculations  of  this  type  of  reaction  are  very  demanding  in  terms  of  numerical  ro¬ 
bustness  and  accuracy,  since  the  reactions  usually  occur  very  fast  with  significant 
energy  release  which  takes  place  in  a  very  short  distance.  Two  results  from  Lehr’s 
[48]  experiment  were  modeled  numerically.  The  two  cases  involve  a  15mm  sphere- 
cylinder  shaped  projectile  moving  through  a  stoichiometric  mixture  of  hydrogen-air 
(Ex,  =  42662/^0.  Too  =  250“/\ )  at  velocities  of  2605  m/s  and  1685  m/s,  respectively. 
Inviscid  calculations  were  performed  on  a  72  x  65  grid  Converged  solutions  were 
obtained  within  4000  iterations  or  about  40  cpu  minutes.  The  combustion  model 
used  is  listed  in  Table  6.  It  includes  6  reacting  species  {H2,02,  H2O,  H,0,  and  OH) 
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Figure  23:  Geometry  of  the  Mars  Pcnetrator 


Figure  24:  Temperature  contours  for  the  Mars  penetrator 
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Figure  25;  Hydrogen-Air  Shock  Induced  Combustion 

and  1  inert  species  {N2)  and  is  similar  to  the  model  we  used  in  previous  calcula¬ 
tions  [71,  72.  73],  and  the  model  used  by  Yungster  [66,  67,  68]. 

Experimentally,  Lehr  found  that  the  superdetonative  case  resulted  in  a  coupled 
shock-deflagration  system,  whereas  the  subdetonative  case  yielded  a  decoupled  shock- 
deflagration  system.  These  two  cases  were  also  modeled  numerically  by  [66]  and  [72]. 
Figures  25a  and  25b  show  the  computed  temperature  contours  for  the  two  caises 
along  with  experimental  shadowgraph  from  [48].  The  results  correctly  predict  the 
location  and  the  shape  of  the  wave.  In  the  superdetonative  case,  it  appears  that  the 
combustion  front  decouples  from  the  shock  wave  around  the  shoulder  of  the  projectile. 
This  phenomenon  was  also  seen  in  Lehr’s  experimental  shadowgraph.  Unlike  the 
superdetonative  case,  the  combustion  front  in  the  subdetonative  case  is  decoupled 
from  the  bow  shock  wave  and  is  separated  by  an  induction  region.  The  temperature 
appears  to  be  constant  in  the  induction  region  and  is  followed  by  a  sharp  rise  at  the 
combustion  front. 

7.5  Premixed  hydrogen-air  past  a  10°  ramp 

In  the  fifth  test  case  the  HAN  A  calculations  are  compared  with  calculations  presented 
by  Chitsomboon  et.  al  [77]  who  used  the  Roger  and  Chinitz  two-step  combustion 
model.  The  test  case  involves  a  supersonic  premixed  stoichiometric  hydrogen-air 
mixture  which  is  ignited  by  the  boundary  layer  and  a  shock  wave  induced  by  a  10° 
compression  ramp.  The  M  =  4  flow  wets  initially  at  900° A'  and  1  atmosphere.  The 
same  combustion  model  as  in  the  previous  case  was  used  in  this  calculation.  The  shock 
and  the  boundary  layer  raise  the  temperature  and  start  the  combustion  process.  The 
calculation  was  performed  using  a  80  x  .50  grid  requiring  .5000  iterations  and  45  cpu 
minutes.  Results  are  compared  in  Figures  26a  through  Figures  26c.  The  HAN  A 
results  compare  favorably  with  the  results  of  Chitsomboon.  However,  our  results 
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Figure  26;  Hydrogen-Air  Past  a  10°Ramp 

show  significantly  larger  H2O  and  smaller  OH  mass  fractions  than  their  results.  The 
differences  are  due  to  the  different  chemistry  models  used.  Our  model  yields  somewhat 
faster  and  more  complete  reactions  than  the  global  two-step-reaction  of  Roger  and 
Chinitz.  The  same  behavior  was  also  observed  by  Shuen  and  Yoon  [70]. 

7.6  Hydrogen-air  oblique  detonation  ram  accelerator 

The  sixth  test  case  involves  the  superdetonative  operation  mode  of  a  ram  accelerator 
in  a  stoichiometric  hydrogen-air  mixture.  Results  from  the  calculation  are  compared 
to  Yungster's  calculation  [67].  The  geometry  is  a  15  cm  long  projectile  with  a  14° 
cone  half  angle  and  a  1.95  cm  maximum  diameter  inside  a  3.0  cm  diameter  tube. 
The  projectile  is  moving  through  a  300°K,  1  atmosphere,  stoichiometric  hydrogen- 
air  mixture,  and  its  wall  is  rissumed  to  be  isotherm^d  at  600°K.  The  geometry  and 
the  flow  parameters  are  similar  to  those  used  by  Yungster.  However,  information 
was  not  available  for  the  radius  of  the  round  corner  at  the  transition  region  between 
the  nose  and  the  body  of  the  projectile.  A  135  x  51  mesh  and  a  7  species,  8  step 
chemistry  model  was  used  in  the  calculation.  A  converged  laminar  calculation  is 
achieved  within  8000  iterations  requiring  about  2  cpu  hours.  The  three  different  Mach 
numbers  considered  are:  M  =  6.7, 7.5,  and  8.0.  Temperature  contours  for  these  cases 
are  shown  in  Figure  27,  Figure  28,  and  Figure  29  for  M  =  6.7, 7.5, 8.0,  respectively. 
Water  vapor  mass  fraction  contours  for  these  C2tses  are  shown  in  Figure  30.  Figure  31. 
and  Figure  32.  The  pressure  profiles  at  the  projectile  and  the  tube  walls  for  the 
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Figure  33:  .V/  =  6.7:  Pressure  Profile  along  Projectile  Wall 

M  =  6.7  case  are  compared  to  results  taken  from  Yungster's  calculations  in  Figure  33 
and  Figure  34.  .As  previously  computed,  the  combustion  begins  in  the  boundary  layer 
region,  propagates  outwards  to  fill  the  whole  gap  between  the  projectile  wall  and 
the  tube  wall,  and  finally  establishes  a  shock-deflagration  front.  Results  show  good 
agreement  with  Yungster's  calculation.  Small  discrepancies  are  due  to  the  difference 
in  the  expansion  region  at  the  nose/body  junction  of  the  projectile. 

7.7  Methane-air  shock  induced  cr*mbustion 

The  seventh  set  of  test  cases  was  selected  from  the  French-German  Research  Insti¬ 
tute  of  Saint-Louis  (ISL)  shock  tube  experiments  [80].  At  ISL,  the  two  experiments 
described  in  Table  3  were  performed  using  a  stoichiometric  methane-air  mixture  at 
P-^  =  'jlOOOPa.  Too  =  2db°I\.  and  =  2333m/s. 


Case  .\o.  Description 

1.  Blunt  cylinder  with  different  chemistry  models 

2.  _ Cylinders  ct  various  diameters 


Table  3:  .Methane- .Air  Combustion  Cases 

Four  different  chemistry  models  were  used  for  the  calculations  of  the  blunt  cylin¬ 
der.  they  are:  a  '‘full"  model,  a  quasi-global,  a  three-step,  and  a  single-step  global 
combustion  model.  The  full  model  was  derived  from  several  published  methane-rir 
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Figure  34;  .V/  =  6.7:  Pressure  Profile  along  Tube  Wall 

combustion  models.  There  is  significant  variation  in  some  of  the  reaction  rates  cal¬ 
culated  from  the  published  rate  coefficients  for  the  different  models.  The  rate  which 
gave  the  best  solution  for  shock  induced  combustion  on  a  blunt  body  was  chosen.  The 
final  model  consists  of  13  species  and  19  elementary  reaction  steps,  see  Table  7.  This 
model  is  called  full  because  it  contains  no  global  reactions.  It  is,  however,  truncated 
and  a  detailed  sensitivity  study  wcis  not  performed  to  determine  that  all  ignored  re¬ 
actions  arc  insignificant.  Note  that  for  simplicity,  nitrogen  is  assumed  to  be  inert  in 
this  iTiodel. 

The  second  model  tested  was  due  to  Wesbrook  and  Dryer  and  is  called  the  quasi- 
global  reaction  model.  It  includes  10  species  and  12  reaction  steps,  see  Table  S.  The 
model  was  first  used  by  Edelman  and  Fortune  [82],  who  combined  a  single  reaction 
of  fuel  and  oxidizer  with  a  detailed  reaction  mechanism  of  a  CO  —  H2  —  O-i  system. 
It  provides  a  better  equilibrium  temperature  and  a  more  accurate  representation  of 
concentrations  of  the  products  and  reactants  than  the  global  reaction  alone.  The  first 
reaction  in  this  model  is  a  phenomenological  model  and  the  reaction  rate  is  given  in 
the  form  of 

K,.  =  CT^exp[-ElkT)  [FuelY  [Oxidizer]^ . 

.\  small  modification  was  made  to  INCA  to  accommodate  this  form  such  that 

where  k^rr  is  the  standard  .\rrhenius  form  of  the  reaction  rate.  It  is  important  to  note 
that  after  .some  careful  comparisons  to  the  original  model  listed  in  Westbrook  and 
Dryer  Toj.  reactions  involving  H1O2  and  HO2  were  omitted  without  any  noticeable 
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clifFerences.  Thus,  the  model  was  reduced  from  12  species  and  21  reactions  to  10 
species  and  12  reactions. 

The  last  two  chemistry  models  are  global  models  that  were  developed  by  West¬ 
brook  and  Dryer  [To.  76]  and  used  in  the  ram  accelerator  calculations  by  Nusca  [74, 
81].  Tiiey  are  a  tiiree-step  model  having  7  species  and  a  one-step  model  having  5 
species. 

In  the  first  case,  the  geometi}'  is  a  7  mm  diameter  blunt  c}linder.  Calculations 
w('re  jrerioiiiK'fl  on  a  72  x  6.o  mesh.  Temperature  contours  from  the  laminar  calcu- 
!ali(;n.s  for  all  four  models  are  shown  in  Figures  .Toa  through  Figures  35d.  Both  the 
full  model  and  the  (luasi-global  model  predict  a  coupled  shock-deflagration  system 
iH'ar  the  stagnation  region  which  then  decouples  at  the  shoulder  of  the  cylinder.  The 
two  contour  plots  show  a  remarkable  resemblance  to  the  interferometry  pictures  from 
ISL  [SO].  Fnfortunately.  a  more  quantitative  comparison  to  the  experimental  results 
is  not  possible  at  present.  On  I  he  other  hand,  the  three-step  and  global  models  grossly 
!:nderj)redirt  the  amount  of  combustion. 

Figures  .Toe  and  Tof  show  the  stagnation  line  pressure  and  temperature  compar¬ 
ison  betwe('n  the  full  model  and  the  quasi-global  model.  It  appears  that  the  two 
niodc'ls  compare'  favorably.  However,  the  quasi-global  model  slightly  overpredicts  the 
eciuilibriuin  temperature.  It  is  important  to  note  that  the  full  mode!  used  here  is  a 
significant  simplification  of  the  complete  model  listed  in  [76].  and  that  there  is  much 
uiicertaintv  in  some  of  the  rate  coefficients.  Thus,  the  quasi-global  model  appears 
t(j  \  ield  a  reasonable  result  with  about  20%  saving  in  epu-time  compared  to  the  full 
mcjdel. 

To  detc'rmiiK'  the  effects  of  turbulence  on  the  shock  and  combustion  front  locations, 
t  he  Baldwin-Loma.x  turbidcMice  model  was  used  for  a  calculation  with  the  quasi-global 
chemist  r>  modei.  The  <  omput(‘d  teini><“rature  contours  are  shown  in  Figures  Tog.  The 
'ho<k  clellauration  locaticni  is  not  affected  by  turbulence.  However,  the  turbulence 
apix-ars  to  enhance  the  combustion  |)roc-ess  behind  the  shoulder  of  the  cylinder.  The 
tnrladent  result  is  not  significant ly  diffen'iit  from  the  laminar  solution. 

The  second  case  involves  two-climensional  rods  of  varying  diameter.  The  flow 
fiehl  was  modeled  on  a  72  x  6-o  two-dimensional  blunt  body  mesh.  Results  from  the 
ah  Illations  for  7mm.  Tmm.  1mm.  and  O.-omm  diameters  rods  are  shown  in  Figures  T6a 
through  .'Uicl.  respect i\ely.  .Notice  how  a  coupled  shock  deflagration  system  becomes 
a  decou])led  system  as  the  rod  dianu'ter  is  reduced.  However.  LSL  reported  that  there 
is  a  >harp  ignition  onset  between  the  1mm  and  Tmm  diameter  rods.  This  discrepancy 
( caihl  be  due  to  t  he  inaccuracy  of  t  h<>  cpiasi- global  inodtT  in  1  he  calculation  of  induction 
time.  1  he  position  of  the  pressure  t  rcansduc<'rs  in  the  experiments  is  crucial.  They 
may  have  been  placed  too  tar  away  from  the  rod  itself  to  i>e  abh’  to  identify  the 
deliagration  location  for  all  rod  diainetc'is. 

K\c’ti  though  the  tw(»  ■■calibration"  cases  al)o\e  do  not  show  an  exact  agrr’ernent 
with  th('  experimental  findings,  they  show  a  promising  trf'iid  that  such  a  complicated 
combustion  can  be  mcrdeled  usini;  t h<‘ qiiasi-gh^ljal  chi'mistry  model. 


(a)  Tmm,  (b)3rpjn,  (c)lmin,  (d)0.5min 
Figure  36:  Rods  different  Diameters 


7.8  Methane-air  ram  accelerator 

Tiie  eighth  set  of  test  cases  is  Methane-air  combusting  flow  about  a  ram  accelerator 
operating  in  the  thermally  chocked,  transdetonative,  and  superdetonative  modes. 
Results  are  compared  with  experimental  data  from  the  University  of  Washington, 
Depending  on  the  speed  of  the  projectiles  and  the  Chapman-Jouget  (C-J)  det¬ 
onation  speed  of  the  mixture,  the  ram  accelerator  operation  modes  can  be  divided 
into  three  regimes  [69.  7S];  the  thermally  choked  subsonic  combustion  mode,  the 
transdetonative  mode,  and  the  superdetonative  mode.  The  thermally  choked  mode 
operates  at  speeds  up  to  about  85 /X  of  the  detonation  speed  of  the  mixture.  This 
mode  is  usually  characterized  experimentally  by  a  very  sharp  rise  in  the  tube  pres¬ 
sure  profiles  somewhere  along  the  second  half  of  the  projectile  body.  The  combustion 
usually  occurs  behind  the  projectile,  spreads  to  the  full  tube  diameter,  and  forces 
the  flow  to  thermally  choke  at  some  distance  downstream  from  the  projectile  [69]. 
The  transdetonative  operation  mode  is  usually  observed  when  the  velocity  of  the 
projectile  is  859c  to  1159c  of  the  local  C-J  speed  of  the  mixture  [69].  In  this  regime, 
it  ai^jx’ars  that  projectiles  can  make  a  smooth  transition  from  the  thermally  choked 
mode  1  subdetonative  speed)  to  the  superdetonative  operation  mode  in  a  single  mix¬ 
ture  [691.  The  superdetonative  operation  mode  requires  that  the  projectile  travels 
at  a  minimum  speed  of  1159(  of  the  mixture  C-J  speed  [65].  The  basic  principles 
of  this  operation  mode  are  similar  to  the  oblique  detonation  wave  engine  and  the 
scrarn-acceleratcjr  [79]. 
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Figure  .37:  Ram  .Accelerator  Mesh  and  Geometry 
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In  a  previous  report  [73],  we  concluded  that  by  increasing  the  projectile's  speed, 
the  normal  shock  (which  is  the  characteristic  of  the  thermally  choked  subsonic  com¬ 
bustion  mode)  moves  downstream  on  the  projectile  and  detaches  from  the  projectile's 
body  as  it  enters  into  the  transdetonative  operation  mode.  As  the  transdetonative 
mode  is  entered,  an  oblique  shock  appears  which  is  attached  at  the  bcise  of  the  pro¬ 
jectile.  Since  the  shock  strength  is  not  sufficient  to  render  the  flow  behind  it  subsonic 
over  the  entire  domain,  the  flow  behind  this  oblique  shock  is  a  mixed.  Part  of  the 
combustion  occurs  supersonically  and  the  rest  occurs  subsonically.  These  previous 
calculations,  however,  only  considered  the  bulk  combustion  processes  and  neglected 
the  effects  of  the  boundary-layer  interactions.  .An  objective  of  the  present  calculation 
was.  therefore,  to  analyze  the  importance  of  boundary  layers  in  the  ram  accelera¬ 
tor  flow  fields.  It  is  shown  that  the  ‘‘free- slip  wall”  assumption  can  only  be  used 
in  the  thermally  choked  calculations,  and  boundary-layer  interactions  are  extremely 
important  in  higher  velocity  operation  modes. 

The  geometry  considered  is  shown  in  Figure  37.  This  geometry  is  the  actual  ram 
accelerator  geometry  given  in  [78].  The  tube  diameter  is  38  mm.  The  projectile  has  a 
10‘'  nose  cone  half-angle,  a  forebody  length  of  82.06  mm,  a  body  length  of  71.1  mm. 
and  a  base  diameter  of  17.8  mm.  The  three-zone  computational  domain  consisted  of 
63  <  60.  81  <  60.  and  84  <  120.  mesh  points  clustered  near  the  projectile  wall  and 
the  tube  wall.  The  free-stream  mixture  is  -I- 2O2  +  5.0N2  31  atmospheres 

and  300‘'/\'.  The  Chaprnan-.Iouget  (CJ)  detonation  velocity  for  this  mixture  is  1770 
m/s  [78].  Caiculations  were  performed  under  a  laminar  flow  assumption.  A  converged 
.solution  was  usually  achieved  within  9.000  iterations  requiring  approximately  10  Cray 
'\'/MP  epu  hours/case.  The  domain  was  initialized  with  free  stream  conditions  with 
a  normal  shock  at  the  base  of  the  projectile.  Behind  the  shock,  the  pressure  was  set 
to  10  times  the  free  stream  value,  the  temperature  was  set  to  2500°/f,  and  the  rest  of 
the  li('ld  variables  were  set  to  free-stream  conditions.  The  boundary  conditions  were: 
no-slip  on  the  projectile  surface,  free-slip  on  the  tube  wall,  a  symmetry  plane  on  the 
center  line  behind  the  projectile,  and  supersonic  outflow.  The  combustion  model  for 
tilt'  :am  accelerator  calculations  is  the  quasi-global  methane-air  model  described  in 
the  prei'ious  sect  ion. 

.\  total  of  tt'ii  runs  were  made  to  demonstrate  the  thermally  choked  and  the 
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transdetonative  operation  modes  of  the  ram  accelerator.  The  ten  runs  are  listed  in 
Tai)le  1. 


Case  .No. 

Description 

f  x(m/s) 

UIVcj 

1. 

Thermally  (.’hoked 

1250 

0.71 

2 

Thermally  Choked 

1350 

0.76 

3. 

Transdetonative 

1470 

0.83 

4. 

Transdetonative 

1540 

0.87 

5. 

Transdetonat  i  ve 

1625 

0.92 

6. 

Transdetonative 

1685 

0.95 

, 

Transdetonative 

1760 

0.99 

s. 

Transdetonative 

1840 

1.04 

9. 

Transdetonative 

1965 

1.11 

10. 

Transdetonative 

2015 

1.14 

Table  I;  Ram  .\ccelerator  Cases 


The  computed  temperature  contours,  and  pressure  profiles  along  the  tube  wall  for 
:  he  ten  cases  are  shown  in  Figures  dS  through  47.  respectively.  !n  the  thermally  choked 
mmie.  one  can  see  clearly  the  series  of  oblique  waves  which  culminate  in  a  normal 
-hock.  This  normal  shock  is  stalulized  on  the  projectile's  body  by  the  massive  heat 
a.  id  It  ion  behind  the  projectile  and  the  thermal  choking  point  which  occurs  about 
i  orojectile  length  Ix-hind  the  liase.  It  is  important  to  note  that  the  location  of 
thi-  normal  shock  is  strongly  alfectcd  by  the  amount  of  heal  addition  behind  the 
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Figure  47;  Mcj  =  1.14  Ram  Accelerator 


base.  Thus,  it  depends  on  the  reaction  rates  used  in  the  combustion  model.  By 
using  different  combustion  models,  the  unstart  phenomenon  has  been  experienced  in 
which  the  normal  shock  and  combustion  front  move  upstream  of  the  throat,  and  the 
projectile  experiences  negative  thrust.  A  small  boundary  layer  combustion  region  also 
appears  following  the  normal  shock.  This  boundary  layer  combustion  is  suspected  to 
be  due  to  the  adverse  pressure  gradient  behind  the  shock,  which  causes  the  laminar 
boundary  laver  to  separate.  In  the  separation  region,  the  temperature  and  residence 
times  are  high  enough  to  initiate  combustion.  .A  similar  behavior  has  also  been 
observed  in  the  oblique  detonation  mode  calculations  of  Yungster  [68].  The  first 
dip  ill  the  pressure  profiles  corresponds  to  the  expansion  region  at  the  base  of  the 
projectile.  Behind  this  expansion  point,  the  pressure  increases  until  it  reaches  a  local 
maximum  where  the  combustion  front  reaches  the  tube  wall.  Downstream  of  this 
expansion  point,  the  flow  chokes  and  becomes  transonic.  The  two  thermally  choked 
calculations  appear  to  agree  with  the  experimental  results  observed  at  the  University 
of  Washington  [78].  The  base  region  acts  as  a  flame  holder  at  these  velocities  and 
there  is  no  combustion  in  the  boundarv  layer  upstream  of  the  normal  shock. 

In  the  transdetonative  operation  mode,  the  culminating  shock  is  no  longer  a  nor¬ 
mal  shock.  It  changes  its  characteristics  and  becomes  an  oblique  shock  wave.  .As 
the  velocitv  is  increased,  the  shock  recedes  along  the  body  of  the  projectile.  Since 
the  shock  is  no  longer  normal,  it  is  not  sufficient  to  render  the  flow  behind  it  sub¬ 
sonic  over  the  entire  tube  area.  The  flow  behind  this  oblique  shock  is  mixed  flow, 
where  a  full  combustion  is  reached  in  a  very  short  distance  in  the  subsonic  region 
but  much  slower  in  the  supersonic  part.  This  oblique  shock  is  followed  immediately 
by  a  combustion  front  which  starts  at  the  boundary  layer.  This  partial  combustion 
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Figure  48:  Thrust  Coefficient  of  the  Ram  Accelerator 


on  the  body  of  the  projectile  has  also  been  observed  experimentally.  As  the  veloc¬ 
ity  is  increased,  the  location  of  the  shock/deflagration  system  continues  to  recede 
along  the  body  until  it  reaches  the  projectile’s  b2ise.  This  point  coincides  with  the 
lowest  thrust  coefficient  which  occurs  at  U/Ucj  =  0-92.  In  contrast  to  the  inviscid 
results  [73].  beyond  this  velocity  the  shock/deflagration  front  now  moves  forward  on 
the  projectile's  body.  This  phenomenon  is  related  to  the  upward  sweep  in  the  thrust 
coefficient.  At  %-elocities  higher  than  1840  m/s,  combustion  occurs  prematurely  in  the 
boundary  layer  at  the  nose  region  of  the  projectile.  This  behavior  has  also  been  ob¬ 
served  in  the  oblique  detonation  mode  calculations  of  Yungster  [68]  and  experimental 
luminosity  data  [69.  78.  83].  In  contrast  to  Yungster’s  results  however,  the  full  tube 
combustion  is  not  reached  until  about  2  to  4  projectile’s  length  downstream.  The 
location  where  a  full  tube  combustion  begins  is  also  observed  as  the  peak  pressure 
in  the  tube  wall  pressure  profiles.  These  locations  (where  a  full  tube  combustion 
begins)  appear  to  be  further  back  than  the  experimental  data.  This  discrepancy 
is  probably  due  to  neglected  turbulence  and  three-dimensional  effects.  This  down- 
stream/upstream  movement  of  the  shock/ deflagration  front  has  also  been  observed 
by  Xusca  [81|.  However,  since  his  results  were  obtained  by  coupling  two  different 
codes  in  an  iterative  manner  (where  the  flow  field  is  partitioned  by  a  normal  shock), 
the  culminating  shock  was  always  a  normal  shock. 

The  computed  thrust  coefficients  for  different  velocities  are  compared  to  the  thrust 
coefficient  derived  from  the  experimental  observations  in  Figure  48.  The  computed 
thrust  coefficient  shows  very  good  agreement  with  the  experimental  data.  It  is  impor¬ 
tant  to  note  that,  after  some  careful  comparisons  to  the  experimental  data,  there  are 
two  discrepancies.  First,  the  pressure  profiles  adong  the  tube  wall  indicate  that  this 
shock/deflagration  system  has  a  very  high  peak  pressure  which  is  not  observed  in  the 
experimental  results.  This  phenomenon  may  be  due  to  simplified  free-slip  boundary 
conditions  applied  at  the  tube  wall.  Second,  the  first  pressure  peak  behind  the  throat 
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is  luu'cr  in  the  ccjinpntefi  results  than  that  observed  experimentalh'.  This  discre])an<  \' 
is  "huwn  in  the  lu'xt  section  to  be  dn<'  to  fin  interactions.  Fin  effects  miirlit  also 
a<  c(aint  for  nidications  in  the  Inminosity  data  that  there  is  combustion  at  tin'  thia^at. 

Results  Iroin  this  s('t  of  calculations  indicate  the  importance  of  the  boundai\ 
layer  ccjinlnist  icjii  m  the  t  ransdctonat  ive  o[)eration  mode  of  the  ram  accelerator,  ft 
als(j  gives  .i  pcjssible  explanation  of  the  second  upward  sweep  in  the  thrust  data. 
I'in.dly.  the  overall  result^  'how  a  \'(>ry  good  agreement  with  tin*  ex]n-riinent,d  ri'suii' 
and  !  lends  Iroiti  the  I  niwusity  id  Washington.  However,  there  are  several  aspeci.- 
neulected  in  t  his  calculation  which  .tr<'  suspected  to  play  a  significant  role  in  iniinerical 
siniuiations  ot  die  t ransdetoiiat i\-e  o[)eration  mode.  One  is  the  effect  of  tin'  hinhl'c 
swept  tins  which  will  generate  t hr<'e-dimensional  shock  waves  and  vortict's  ahnig  the 
projectile  bod'.'.  The  three-dimensional  shock  waves,  vortices,  and  Itoundary  layer 
ma_\'  interact  and  initiate  throat  cotnbustion  at  lower  velocities  than  prt'dicted  by  the 
a.\i'\'tninet ric  calculations.  I  he  t hre(‘-dimensional  shock  and  the  throat  combustion 
couid  filso  exjilain  the  under[>redict ion  of  tlie  first  pressure  peak  dowtistream  of  the 
t  liroiit . 


7.9  Ideal  Gat;  3-D  Rani  Accelerator 

1  he  fiii.d  tc't  case.  ,1  jierfect  uas  calculation  for  a  three-dimensional  ram  accelerator 
[Uojectile.  was  'elected  to  (diicidate  th<*  effects  of  fin  interactions.  The  projectile 
geometr\’  is  'iimlar  to  t  he  one  us('d  at  t  fie  f  niversity  of  W’ashington  [bid-  Howe\er.  to 
'imihiiiv  the  computation,  the  height  of  tlie  fin  is  truncated  slightlv  to  allow  a  3  tnm 
'Uap  bet  ween  the  fin  and  the  tube  wall.  The  gap  in  the  experiment  is  essiuit  ial!\-  zero 
1  aripioximaieiy  (I.O")  mmi.  Further,  a  laminar  flow  and  free  slip  walls  were  assumed, 
I  he  free  stream  coiiditioiis  were  P.  =  31  at inos|rliere.s.  =  300'' A’,  and  (\  =  balU 
m  1  he  caicuiai  loll  '.v.is  jierfoinied  witfi  tlie  10  •  20  ■-  20  singh'-zone  iiie'h  'howi, 
in  liu'ire  !'i,  I  h,.  eomimted  'tream-tnbes  !  set"  l-'ignre  19)  show  a  vortex  rollui)  aioiiu 
the  'ide  of  1  lie  fin,  h  iguri'  do  'hows  t  h('  tnfie  wail  pressure  profiles  at  three  diiferem 
orientation'  with  ic'pect  to  the  fin  d)  .Id'd  and  \rp).  .■\t  O'’  the  jump  in  the  fii'i 
pie-'iire  pe.iK  Aa'  .ibout  30  tinies  the  free-'tri'am  pres.snre.  This  peak  has  ne'.er  been 
'ei-n  m  the  axi'Vinmei  ne  ciilculai  ions  and  is  solely  dm*  to  the  fin  (‘ffeit.  I  in-  ;u:m> 


Figure  50;  Tube  Wall  Pressure  Profiles  at  U“.  15^.45“  angles  to  the  fin 

is  also  associated  with  a  three-dimensional  oblique  shock  reflection  which  is  shown 
in  the  45'^  pressure  trace  profile  as  a  second  pressure  peak.  In  the  corresponding 
a.xis'cmmetric  calculation,  the  tube  wall  pressure  profile  was  similar  to  the  one  at  45° 
without  the  second  pressure  peak. 

The  oblique  shock  wave  due  to  the  fin  interactions  changes  the  flow  features  n  the 
ram  accelerator  projectile  flow  fields.  It  also  sheds  light  on  the  discrepancies  in  the 
cotiiparison  between  the  experimental  pressure  profiles  and  results  from  axisymmet'’i^ 
calculations.  Unfortunately,  this  calculation  does  not  give  information  on  the  effeci.s  cf 
the  slucck  wave  on  the  boundary  layer  and  combustion  initiation.  Obviously,  the  wave 
will  raise  the  temperature  behind  it  significantly  which  may  be  enough  to  start  the 
comliustion  for  some  cases.  More  three-dimensional  calculations  should  be  performed 
itu  luding  boundary  layer  effects  and  finite-rate  combustion. 


Reaction 

c 

n 

Ejk 

1 

CO-  M 

== 

0  +  0  +  M 

4.5x10''^ 

-1.0 

128.900 

•> 

CO:  -  M 

= 

CO^O- M 

.4.7X10*-’ 

0 

52.500 

.4 

0:  -  M 

20  ^  M 

2.0x10^’ 

-1.5 

59.500 

4 

CO  -  CO 

CO:  -  C 

2.4x10'^^ 

0.5 

65.710 

5 

CO  -  0 

0:  -  C 
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0  :  -h  C 
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0 
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1 
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,s 
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0 

75.500 

1) 
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= 

.VO  +  C 

2.0x10" 

0.5 

53.630 

10 

.\b  0 

.VO  +  .V 

6.4x10'" 

-1.0 

38.370 

11 
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4.6x10'’'' 
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12 

.VO  ^  0 
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0:  +  y 

8.4x10'" 

0 

19.4.50 

Table  5;  Chemistry  model  for  .Mars  atmospheric  gases 


No. 

Reaction 

C 

1. 

H  +  02<=>  OH  +  0 

2.2E14 

0.0 

8455.0 

2. 

0  +  Hi  OH  +  H 

7.5E13 

0.0 

5586.0 

3. 

//2  +  Oi/  «  //  +  HiO 

2.0E13 

0.0 

2600.0 

4. 

OH  +  0H  ^0  +  HiO 

5.3E12 

0.0 

503.0 

5. 

Hi  +  M  ^2H  +  M 

5.5E18 

-1.0 

51987.0 

6. 

HiO  +  M  ^  OH  +  H  +  M 

5.2E21 

-1.5 

59386.0 

7. 

OH  M  ^  0  +  H  +  M 

8.5E18 

-1.0 

50830.0 

8. 

O2  +  M  0  -+-  0  +  M 

7.2E18 

-1.0 

59340.0 

Table  6:  Hydrogen-Air  Combustion  Mode) 


No. 

Reaction 

C 

n 

E/k 

C//4  +  M  4=>  CH3  +  H  +  M 

1.50E+17 

0.0 

44600.0 

CH4  +  H  CHz  +  Hi 

2.00E+14 

0.0 

5982.7 

CH4  +  OH  ^  CH3  +  HiO 

3.00E4-13 

0.0 

2513.7 

CH4  +  0  CHz  +  OH 

2.00E-fl3 

0.0 

3469.0 

CH3  +  0  CHiO  4-  H 

8.00E4-13 

0.0 

502.7 

6 

CHz  -f  O2  «  CHiO  +  OH 

4.00E+13 

0.0 

8798.1 

7 

CHiO  +  M  CO  +  Hi M 

2.OOE4-I6 

0.0 

17596.2 

8 

CHiO  +  OH  ^  CHO  4-  HiO 

2.50E-fl3 

0.0 

502.7 

9 

CHiO  4-  0  CHO  +  OH 

3.00E+13 

0.0 

0.0 

10 

CHiO  +  H  <=i>  CHO  +  Hi 

1.70E4-13 

0.0 

1508.2 

11 

CHO  -\-OH  ^  COi  +  H 

l.OOE+14 

0.0 

0.0 

12 

CHO  +  M  ^  CO  +  H  +  M 

2.OOE4-I2 

0.5 

14479.2 

13 

CO  +  OH  ^  COi  4-  H 

5.50E+11 

0.0 

543.0 

14 

H  +  Oi^  OH  ^0 

2.20E-fl4 

0.0 

8455.0 

15 

0  +  Hi  OH  +  H 

7.50E4-13 

0.0 

5586.0 

16 

OH  +  Hi^  HiO  +  H 

2.00E-fl3 

0.0 

2600.0 

17 

OH  +  OH  HiO  -f  0 

5.30E+12 

0.0 

503.0 

18 

Hi  +  M  ^  H  +  H  +  M 

5.5OE4-IS 

-1.0 

51987.0 

19 

HiO  +  M  ^  OH  +  H  +  M 

5.2OE4-21 

-1.5 

59386.0 

Table  7:  Methane- Air  “Fuir  Combustion  Model 
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No. 

Reaction 

C 

n 

E/k 

C//4  +  I.5O2  =i>  CO  +  2H2O 

4.00E+10 

0.0 

24333.0 

CO^-OH  CO2  +  H 

1.50E+07 

1.3 

-402.2 

CO  +  O2  s — v  CO2  +  0 

3.10E+11 

0.0 

18903.3 

CO  +  0  +  .\i  CO2  +  M 

5.90E+15 

0.0 

2061.3 

H  +  02^  OH  +  0 

2.20E+14 

0.0 

8446.2 

6 

0  +  H2  OH  +  H 

1.80E+10 

1.0 

4474.5 

7 

OH  +  H2  «  //2O  -b  H 

2.20E+13 

0.0 

2564.0 

8 

H2O  +  0  ^  OH  +  OH 

6.80E+13 

0.0 

9250.6 

9 

OH  +  M  0  +  H  +  M 

8.00E-fl9 

-1.0 

52135.0 

O2  +  M  ^0  +  0  + M 

5.10E+15 

0.0 

57816.1 

11 

H2  +  M  H  +  H  +  M 

2.20E+14 

0.0 

48263.8 

12 

H2O  +  M  ^  OH  +  H  M 

2.20E+16 

0.0 

52788.6 

Table  S;  Methane- Air  Quasi-Global  Combustion  Model 
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8  Evaluation  and  Comparisons  of  Parallel  Perfor¬ 
mance 

In  this  section,  timings  for  six  of  the  test  cases  described  in  the  previous  section  are 
presented.  From  these  timings,  the  parallel  performance  of  pHANA  can  be  estimated. 

Two  parameters  commonly  used  to  measure  the  parallel  performance  of  an  ap¬ 
plication  program  are  speedup  and  efficiency.  Speedup  is  defined  as  the  ratio  of  the 
time  required  to  run  a  particular  application  on  one  processor  to  that  required  to  run 
on  N  processors.  Efficiency  is  defined  as 

Efficiency  =  Speedup/ V, 

or  the  fraction  of  the  maximum  possible  speedup  obtainable  with  N  processors.  Am¬ 
dahl's  law  relation  is  often  used  to  determine  the  maximum  possible  performance  of 
an  algorithm.  Using  Amdahl’s  relation,  the  speedup  and  efficiency  are  given  by 


Speedup  = 


p  -f  (1  —  p)N 


=  pT(i-p)N 

where  p  is  the  fraction  of  the  calculation  that  is  performed  in  parallel.  Obviously, 
algorithms  that  can  give  p  =  1  arc  of  particular  interest.  With  such  algorithms  there  is 
no  inherent  limit  placed  on  the  speedup  by  the  algorithm  itself.  The  current  algorithm 
is  designed  to  achieve  nearly  100%  or  p  =  1  parallelism.  However,  in  practice,  there 
are  several  factors  that  limit  the  performance  of  the  current  implicit  algorithm: 

1.  load  balancing  (tib) 

2.  interprocess  communication/data  sw'apping  and  synchronization 

3.  additional  work  due  to  domain  decomposition  (<add/) 

4.  reduction  in  vector  length  (Uec<) 

5.  competition  for  cpu  time  among  users  (for  multi-user  systems)  (<ex«r) 
Therefore,  the  time  required  to  complete  a  calculation  on  N  processors  (<.v)  is 

+  tib  +  tc  +  iaddl  +  t  vect  'E  ^extr 

where  U  is  the  time  required  to  solve  the  same  problem  on  one  processor. 

Load  balancing  refers  to  the  even  distribution  of  the  computational  time  among 
the  processors.  For  our  implicit  algorithm,  it  can  be  eliminated  by  satisfying  a  simple 
formula: 

MOD  [{I D  + in,  -  l)*4],n.)  =  MOD([JD  +  {n,  -  l)*4],nj)  =  0. 
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where  ID.  JD  are  the  number  of  mesh  points  in  the  i-  and  j-directions,  and  n,  and 
n j  are  tiie  number  of  divisions  used  in  each  direction  to  decompose  the  domain.  In 
this  section,  the  load  balancing  problem  has  been  eliminated  in  all  cases  by  select¬ 
ing  combinations  for  the  domain  decomposition  such  that  perfect  load  balancing  is 
achieved. 

Interprocess  communication/data  swapping  and  synchronization  are  an  important 
consideration  in  the  design  of  parallel  algorithms.  Even  for  a  perfectly  parallel  algo¬ 
rithm.  the  total  computational  time  is  alwaj’s  augmented  by  the  time  to  perform  the 
necessary  communication  between  processors.  Ihe  communication  time  is  dependent 
on  the  hardware  bandwidth,  the  message  passing  software,  and  the  algorithm.  The 
ratio  of  the  time  required  for  message  pcissing  to  the  total  run  time  depends  on  the 
amount  uf  work  (number  crunching)  on  each  processor.  It  is  approximately  inversely 
proportional  to  the  square  root  of  the  number  of  mesh  points  assigned  to  each  pro¬ 
cessor  and  to  the  number  of  species  included  in  the  calculation.  For  a  distributed 
memory  system,  it  is  generally  advantageous  to  minimize  the  number  of  messages 
passed  and  to  maximize  the  size  of  each  message.  This  reduces  the  effect  of  latency 
time  (or  the  time  required  to  send  a  0-byte  message)  on  the  parallel  efficiency. 

The  domain  decomposition  technique  adopted  in  the  current  implicit  algorithm 
formulation  is  not  perfect.  One  of  its  drawbacks  is  that  additional  work  is  required 
as  more  processors  are  used  to  solve  a  problem  in  a  parallel  manner.  The  additional 
work  occurs  at  the  mesh  cell  faces  that  define  the  boundaries  between  subzones.  For 
a  single  processor  calculation  the  flux  through  each  face  must  only  be  calculated  on^-v. 
jx'r  time  step.  With  multiple  processors  it  must  be  calculated  twice;  once  for  each 
of  the  subzones  on  either  side  of  the  face.  Thus,  for  parallel  calculations,  the  added 
work  to  compute  these  flu.xes  is  approximately 

taddi  y-  [{JD*  {n,  -  1))  -f  (ID  *  (Uj  -  1))]. 


Of  course  this  work  is  only  involved  in  the  flux  and  implicit  flux  Jacobian  calcula¬ 
tions.  which  together  are  approximately  40%  of  the  total  work.  Therefore  the  final 
additional  work  due  to  subdomain  decomposition  is: 


taddi/h  -  0.40  * 


[  ID  JD 


.\  'J-rell  interproce.ss  communication  patch  was  chosen  over  a  1-cell  patch  which  would 
recpiire  half  as  many  additional  subzone  boundary  cells  due  to  domain  decomposition. 
However,  a  1-cell  patch  would  require  additional  calculations  and  information  to  be 
sent  for  the  explicit  and  implicit  fluxes  to  the  neighboring  processors,  which  is  less 
cost  effective.  It  is  obvious  that  for  a  constant  problem  size  the  additional  work  due 
to  parallelization  increases  with  increasing  number  of  processors.  The  best  parallel 
efficiency  is  obtained  by  solving  the  largest  possible  problem  that  will  fit  in  the  avail¬ 
able  memory  for  a  given  number  of  processors.  While  this  approach  will  optimize 
parallel  efficiency,  it  may  not  satisfy  the  most  common  goal,  which  is  to  minimize  the 
run  time  for  a  constant  size  problem. 
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The  reduction  in  vector  length  is  a  very  important  consideration  if  a  parallel  al¬ 
gorithm  is  to  be  used  on  a  parallel  vector  machine.  It  also  affects  performance  on 
machines  which  utilize  pipelining  architectures.  The  domain  decomposition  divides 
the  overall  domain  into  smaller  subdomains  which  results  in  shorter  vector  lengths. 
Even  without  any  other  factors  affecting  its  performance,  the  time  to  complete  a  cal¬ 
culation  increases  with  the  reduction  in  vector  length.  In  this  report,  vector  length 
reduction  is  especially  noticeable  in  the  Cray  Y-MP  performance,  and  will  be  ad¬ 
dressed  later  in  this  section. 

The  two  shared  memory  systems  tested  in  this  work  are  the  Cray  Y-MP  and  the 
SGI  4D.  Both  systems  are  multi-user/multi-tasking  systems  where  several  users  can 
run  their  specific  application  programs  simultaneously.  On  this  type  of  system,  un¬ 
fortunate!}',  the  presence  of  other  users  affects  the  performance  of  parallel  algorithms. 
Other  users  programs  will  unevenly  delay  the  execution  of  the  pHANA  calculations 
on  some  of  the  processors.  This  leads  to  an  uncontrollable  load  leveling  problem  as 
the  other  processors  sit  idle  while  the  delayed  processor  catches  up. 

In  the  following  subsections,  timing  results  for  three  parallel  computers  are  pre¬ 
sented. 


8.1  Distributed  Memory  MIMD 

The  parallel  HAN  A  code  (pH  AN  A)  was  tested  on  an  Intel  Touchstone  iPSC/860  for 
six  of  the  cases  described  in  section  7.  The  parallel  speedup  and  efficiency  for  these 
cases  are  shown  in  Figures  51  and  52.  The  efficiency  of  the  code  is  different  for  the 
different  cases.  .As  expected,  better  efficiencies  are  achieved  for  the  cases  with  larger 
meshes. 

Figure  5.3  shows  how  the  cpu  time  is  spent  for  the  axisymmetric  combusting  ram 
accelerator  case.  By  holding  the  number  of  mesh  cells- per- processor  constant,  the 
domain  decomposition  overhead  {taddi)  is  constant  and  the  overhead  due  to  reduction 
of  vector  length  for  pipelining  (Geci)  is  eliminated.  Thus,  the  reduction  in  efficiency 
as  the  number  of  processors  is  increased  is  attributable  only  to  the  interprocess  com¬ 
munication  overhead  (G)-  Note  that  for  this  case,  the  calculation  solves  14  sets  of 
partial  differential  conservation  equations  (10  species,  3  momentum,  and  1  energy). 
This  maximizes  the  amount  of  work  on  each  node.  Therefore,  the  ratio  of  commu¬ 
nication  overhead  to  the  total  amount  of  work  is  minimized,  resulting  in  a  very  high 
parallel  performance  (see  Figures  51  and  52).  It  appears  that  when  a  maximum 
amount  of  work  is  assigned  equally  to  all  epus.  only  25%  or  less  of  the  total  time 
is  spent  doing  ‘‘noncomputational”  activities  when  the  number  of  processors  is  less 
than  64. 

.As  depicted  in  Figure  51,  the  speedup  increases  as  the  amount  of  work  relative  to 
the  communication  time  overhead  increases.  Figure  54  illustrates  how  the  cpu  time  is 
spent  for  a  calculation  where  the  number  of  epus  is  increased  and  the  size  of  the  prob¬ 
lem  is  held  constant.  It  shows  the  distribution  of  cpu  time  for  the  hydrogen-air  shock 
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Figure  53:  iPSC/860:  Distribution  of  CPU  time  when  the  amount  of  work/node  is 
constant 


induced  combustion  case.  For  this  case,  increasing  the  number  of  processors  utilized 
in  the  calculation  increases  the  percentage  of  work  due  to  domain  decomposition  and 
decreases  the  working  vector  length.  To  isolate  the  interprocess  communication  over¬ 
head  time  {tc),  a  set  of  calculations  was  performed  varying  the  number  of  cpus  but 
holding  the  work  per  node  constant.  Also  the  effect  of  the  reduction  in  vector  length 
overhead  time  {tyect)  was  approximated  by  running  a  set  of  1-cpu  calculations  where 
the  dimension  of  the  problem  was  changed.  Unlike  the  axisymmetric  combusting  ram 
accelerator  case,  the  hydrogen-air  shock  induced  combustion  only  requires  11  partial 
differential  conservation  equations  (7  species,  3  momentum,  and  1  energy).  Therefore, 
each  subdomain  is  smaller  and  requires  less  work  than  that  for  the  ram  accelerator 
case.  This  results  in  lower  parallel  efficiency  and  a  higher  ratio  of  interprocess  com¬ 
munication  overhead  to  the  total  computational  time.  This  ratio  increases  linearly 
after  15  processors.  More  than  25%  of  the  total  cumulative  time  is  spent  performing 
“noncomputational”  activities  when  the  number  of  processors  exceeds  40. 

Finally,  the  floating  point  operations  per-second  (flops)  for  the  combusting  ram 
accelerator  case  are  given  in  Table  9.  The  flops  are  calculated  based  on  the  floating 
point  operation  count  provided  by  the  hardware  performance  monitor  on  the  Cray  Y- 
MP.  It  appears  that  at  least  48  processors  are  required  to  obtain  a  speed  comparable 
to  a  single  cpu  Cray  Y-MP  for  this  case. 
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Speedup 

Efficiency 

MFlops 

1 

1.0 

100.0 

3.00 

•5.70 

95.0 

17.1 

M 

12.88 

92.0 

.38.7 

dO 

26.71 

89.0 

80.17 

48 

38.40 

80.0 

115.3 

hO 

46.20 

76.9 

138.7 

ta'hli’  iPSC  ^(iO:  Floatiiiu  point  operations  per-second  for  the  rant  accelerator 


I'a.M’. 


Number  of  cpu 

Figure  55;  Speedup  on  Cray  Y-MP 


8.2  Shared  Memory  Cray  Y-MP 

The  parallel  speedup  and  efRciency  on  the  Cray  Y-MP  for  the  six  test  cases  are  shown 
in  Figures  55  and  56.  The  oerformance  of  the  code  appears  to  change  for  the  different 
cases.  The  best  performance  is  achieved  for  the  cases  with  the  longest  vector  lengths. 

Figure  57  .shows  how  the  cpu  time  is  spent  on  the  axisymmetric  combusting  ram 
accelerator  case.  In  this  case,  the  amount  of  work  and  the  vector  length  on  each 
rpu  is  held  constant.  The  reduction  of  efficiency  is  due  only  to  the  interprocess 
communication  overhead  (t^)  and  synchronization.  This  overhead  is  shown  to  be 
less  than  lO'X  for  S  cpus.  Unfortunately  the  same  result  does  not  apply  for  smaller 
cases  where  the  number  of  mesh  cells  is  held  fixed  while  the  number  of  processors  is 
increased. 

To  investigate  the  effect  of  varying  vector  length  on  the  performance  of  the  code, 
we  have  selected  two  sets  of  1-cpu  runs  where  each  dimension  is  changed  while  holding 
others  constant.  The  first  set  fixes  JD  =  11  while  varying  the  number  of  mesh  points 
in  the  I-direction.  and  the  second  fixes  ID  =  53  while  varying  the  number  of  mesh 
points  in  the  .J-direction.  Results  from  this  test  for  are  shown  in  Figure  58  for  the 
h}  drogen-air  shock  induced  combustion  case.  The  vector  length  is  a  very  important 
factor  in  the  achieved  speed  on  the  Cray  Y-MP.  This  result  is  crucial  for  understanding 
the  reduction  of  efficiency  of  a  parallel  algorithm  as  the  number  of  processors  is 
increased.  Using  the  domain  decomposition  technique,  a  computational  domain  is 
broken  up  into  smaller  subdomains  which  are  assigned  to  different  processors.  .A  good 


92 


ID  or  JO  Dimension 

Fig'M(‘  ')8;  Tray  \'-MP;  Effects  of  Vector  Length  Reductions 

parallel  performance  can  only  be  obtained  if  the  dimensions  of  these  subdomains  are 
large  enough  such  that  the  effect  of  vectorization  can  be  neglected.  For  our  test  cases, 
a  vector  length  of  at  least  50  is  required  for  this  to  be  true. 

Figure  59  illustrates  further  how  the  cpu  time  is  spent  for  a  calculation  in  which 
the  number  of  cpus  is  increased  and  the  size  of  the  problem  is  held  constant.  .\s  the 
number  of  processors  is  increased,  the  additional  work  due  to  domain  decomposition 
increases  and  the  vector  length  of  each  subdomain  decreases.  The  overhead  time 
associated  with  reduction  in  vector  length  (tvect)  is  approximated  by  running  a  set  of  1- 
cpu  calculations  where  the  dimension  of  the  problem  was  changed  (see.  Figure  58).  .\s 
mentioned  previously,  the  vector  length  in  this  case  is  much  shorter  than  is  optimum 
and  the  run.  therefore,  results  in  lower  parallel  efficiency.  The  ratio  of  overhead  time 
due  to  vectorization  loss  to  the  total  computational  time  is  increased.  This  represents 
more  than  JO'/f  of  the  total  cpu  time  for  a  constant-size  parallel  run  with  eight  cpus. 

Finally  the  floating  point  operations  per  second  (flops)  measured  for  the  combust¬ 
ing  ram  accelerator  case  are  given  in  Table  10.  This  performance  can  be  improved  by- 
increasing  the  number  of  mesh  cells  in  the  calculation. 

8.3  Shared  Memory  Silicon  Graphics  Iris 

The  parallel  speedup  and  efficiency  on  the  SGI  4D/380  for  the  six  test  cases  are  shown 
in  Figures  60  and  61.  .As  for  the  other  two  computers,  the  performance  of  the  code  is 
different  for  different  problems.  The  performance  appears  to  be  excellent  for  four  or 
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Figuic  60;  Speedup  on  SGI  4-D/380 
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Figure  61;  Efficiency  on  SGI  4-D/380 
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Figure  62:  SGI  4-D/380:  Distribution  of  CPU  time  when  the  amount  of  work/node 
is  constant 

less  processors.  However,  when  a  larger  number  of  cpu’s  are  used,  the  performance 
drops  significantly.  As  mentioned  previously,  the  load  balancing  problem  has  been 
eliminated  for  our  test  cases.  However,  in  a  multi-user  system  one  cpu  may  be  working 
on  two  or  more  applications  at  a  time.  Multiple  applications  running  at  the  same 
time  resulting  in  cpu  competition  which  causes  some  cpus  to  lag  behind  the  others. 
The  faster  cpus  must  then  wait  for  the  slower  ones  to  catch  up.  A  second  factor  is  the 
relatively  low  bandwidth  on  the  SGI  bus.  The  bandwidth  is  too  small  and  cpus  often 
have  to  wait  to  access  memor\‘  when  more  than  four  cpus  are  working  on  a  problem. 

Unlike  the  iPSC/S60  and  the  Cray  Y-MP  discussed  previously,  the  best  perfor¬ 
mance  on  the  SGI  machine  was  not  achieved  for  the  largest  test  case.  This  is  due  to 
the  smaller  cache  memory  available  on  the  machine.  The  cache  was  not  large  enough 
for  the  combusting  ram  accelerator  czise  to  run  without  significant  cache  misses. 

Figure  62  shows  how  the  cpu  time  was  spent  on  the  axisymmetric  combusting 
ram  accelerator  case.  In  this  case,  the  amount  of  work  on  each  cpu  is  constant.  The 
reduction  of  efficiency  is  due  only  to  the  interprocess  communication  overhead  (tc), 
synchronization,  competition  with  other  applications  for  the  cpus.  and  the  low  bus 
bandwidth.  This  overhead  is  shown  to  be  more  than  25%  for  8  cpus. 

Figure  63  illustrates  further  how  the  cpu  time  is  spent  for  the  constant-size  prob¬ 
lem.  .\s  the  number  of  processors  is  increased,  the  additional  work  due  to  domain  de¬ 
composition  increases.  The  overhead  time  associated  with  reduction  in  vector  length 
iti.ect)  is  seen  to  be  minimal  since  the  SGI  is  not  a  vector  machine.  The  amount  of 
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Figure  (hi.  S(!l  1-D/3S0;  Distribution  of  CPU  time  for  constant  problem  size 

time  recpiired  for  communication  and  synchronization  appears  to  be  less  than  that 
for  tile  cornimsting  ram  accelerator  case.  The  cpu  time  spent  on  noncomputaiiona) 
activitii's  is  fairly  low  when  less  than  4  processors  are  utilized.  However,  this  overhead 
time  jumps  significantly  for  higher  number  of  epus  due  to  cpu  competition  and  the 
inability  of  the  bus  to  handle  the  traffic. 

The  floating  points  operations  per  second  (flops)  measured  for  the  combusting 
ram  accelerator  cast'  are  given  in  Table  11.  The  flops  are  calculated  based  on  the 
number  of  floating  points  given  by  hardware  performance  monitor  on  Cray  V-MP. 


.Number  of  Procc.ssors 

Speedup 

Efficiency 

MFlops 

1 

1.0 

100.0 

2.69 

2 

1.98 

98.9 

5.33 
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3.72 

93.0 

10.01 
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4.84 

80.7 

13.02 

8 

.0.56 

69.5 

14.96 

Tabh-  1 1;  SCI  4-D/3S0:  Floating  point  operations  per  second  for  the  ram  accelerator 
case. 
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9  Conclusions 

As  a  result  of  this  investigation  we  can  make  the  following  conclusions. 

•  A  model  code  has  been  written  which  solves  a  set  of  Laplace  equations  on  a  mul¬ 
tiple  zone  mesh  using  point  Gauss-Seidel  relaxation  and  a  diagonal  wavefront 
algorithm.  The  code  was  written  to  simulate  the  operation  of  a  Navier-Stokes 
code  as  closely  as  possible.  The  model  code  was  parallelized  on  the  Cray  Y-MP, 
Intel  Touchstone  iPSC/860.  and  Silicon  Graphics  Iris  4-D/.380  (SGI)  machines. 

•  The  3D  Navier-Stokes  code  was  parallelized  using  domain  decomposition  on 
the  Cray  Y-MP,  Intel  Touchstone  iPSC/860,  and  SGI  4-D/3S0  computers.  The 
governing  equations  and  solution  algorithm  for  this  code,  pHANA.  are  detailed 
in  sections  4  and  5. 

•  Flow  field  and  parallel  performance  results  are  given  for  the  calculations  on  the 
three  computers.  The  test  cases  include  two  and  three-dimensional  flows  at 
subsonic,  transonic,  and  supersonic  speeds.  Thermochemical  models  vary  from 
nonreacting  ideal  gas  to  combusting  gases. 

•  The  main  deterrent  to  parallel  performance  on  the  Intel  iPSC/860  was  inter¬ 
process  communication.  Improvements  to  the  hardware  which  reduce  the  in¬ 
terprocessor  communication  latency  should  dramatically  improve  the  parallel 
performance  of  the  Nai  ier-Stokes  code  particularly  for  cases  which  can  not  be 
sized  to  fully  utilize  the  memory  available  with  each  cpu. 

•  The  main  deterrent  to  parallel  performance  on  the  CRAY  Y-MP  was  the  re¬ 
duction  in  vector  length  due  to  domain  decomposition.  Vector  length  was  not 
as  critical  to  performance  on  the  other  two  computers. 

•  The  main  deterrent  to  parallel  performance  on  the  SGI  4-D/3S0  w'as  memory 
contention,  low  bus  bandwidth,  and  competition  for  CPU  time  with  other  pro¬ 
cesses. 

•  For  all  computers,  increasing  the  size  of  the  problem  as  the  number  of  processors 
increased  yielded  a  better  performance  than  holding  the  problem  size  constant 
as  the  number  of  processors  increased. 
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